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We present a procedure to solve the inverse Ising problem, that is to find the interactions between 

- - - a set of binary variables from the measure of their equilibrium correlations. The method consists 

' in constructing and selecting specific clusters of variables, based on their contributions to the cross- 

, entropy of the Ising model. Small contributions are discarded to avoid overfitting and to make the 

^ ' computation tractable. The properties of the cluster expansion and its performances on synthetic 

' data are studied. To make the implementation easier we give the pseudo-code of the algorithm. 
> ' 

^ : 

O ; I. INTRODUCTION 

in ■ 

The Ising model is a paradigm of statistical physics, and has been extensively studied to understand the equilibrium 
properties and the nature of the phase transitions in various systems in condensed matter [l| . In its usual formulation, 
P5 the Ising model is defined over a set of N binary variables cr^, with i = 1,2, . . . , N . The variables, called spins, are 
^ . submitted to a set of N local fields, hi, and of ^N{N — 1) pairwise couplings, Jij. The observables of the model, such 
' as the average values of the spins or of the spin-spin correlations over the Gibbs measure, 
', 

T3 . {(Jt) , (a-kO-i) , (1) 

d • are well-defined and can be calculated from the knowledge of those interaction parameters. We will refer to the task 
^ ' of calculating ([T]) given the interaction parameters as to the direct Ising problem. 

I , In many experimental cases, the interaction parameters are unknown, while the values of observables can be 
' O ■ estimated from measurements. A natural question is to know if and how the interaction parameters can be deduced 
C from the data (043)- When the coupling matrix is known a priori to have a specific and simple structure, this question 
can be answered with an ordinary fit. For instance, in a two-dimensional and uniform ferromagnet, all couplings vanish 
but between neighbors on the lattice, and Jij = J for contiguous sites i and j. In such a case, the observable such as 
the average correlation between neighboring spins, c, depends on a single parameter, J. The measurement of c gives 
a direct access to a value of J. However, data coming from complex systems arising in biology, sociology, finance, ... 
can generally not be interpreted with such a simple Ising model, and the fit procedure is much more complicated for 
two reasons. First, in the absence of any prior knowledge about the interaction network, the number of interaction 
, parameters Jij to be inferred scales quadratically with the system size N, and can be very large. Secondly, the quality 
l/^ ' of the data is a crucial issue. Experimental data are plagued by noise, coming either from the measurement apparatus 
■ or from imperfect sampling. The task of fitting a very large number of interaction parameters from 'noisy' data has 
I received much attention in the statistics community, under the name of high-dimensional inference (l3j . 

To be more specific, the inverse Ising problem is defined as follows. Assume that a set of B configurations cr"^ = 
T-H . Wi: I • ■ ■ I '^Tvi' with r — 1,2, . . . , B are available from measurements. We compute the empirical 1- and 2-point 
' averages through 

p^ = :^E<' Pki = ^j2<yiar . (2) 
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The inverse Ising problem consists in finding the values of the N local fields, hi, and of the ^N{N — 1) interactions, 
Jij , such that the individual and pairwise frequencies of the spins ([T]) defined from the Gibbs measure coincide with 
their empirical counterparts, pi and pki- While the Gibbs measure corresponding to the Ising model is by no means 
the unique measure allowing one to reproduce the data pi and pki , it is the distribution with the largest entropy doing 
so 0. In other words, the Ising model is the least constrained model capable of matching the empirical values of 
the 1- and 2-point observables. This property explains the recent surge of interest in defining and solving the inverse 
Ising problem in the context of the analysis of biological, e.g. neurobiological 0-11, [13] and proteomic [5,, j|] data. 

As a result of its generality, the inverse Ising problem has been studied in various fields under different names, 
such as Boltzmann machine learning in learning theory [ill . [l2j or graphical model selection in statistical inference 
[3 [3 13 • While the research field is currently very active, the diversity of the tools and, sometimes, of the goals 
make somewhat difficult to compare the results obtained across the disciplines. Several variants of the inverse Ising 
problem can be defined: 
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• A: find the interaction network from a set of spin configurations cr"^. It is generally assumed in the graphical 
model community that the Ising model is exact, that is, that the underlying distribution of the data is truly an 
Ising model with unknown interaction parameters J. The question is to find which interactions Jij are non zero 
(or larger than some Jmin is absolute value), and how many configurations (value of B) should be sampled to 
achieve this goal with acceptable probability. 

• B: find the interactions Jij and the fields hi from the frequencies Pi,Pij only. Those frequencies should be 
reproduced within a prescribed accuracy, e, not too small (compared to the error on the data) to avoid overfitting. 
Note that in general the Ising model is not the true underlying model for the data here; it is only the model 
with maximal entropy given the constraints on 1- and 2-point correlations. 

• C: same as B, but in addition we want to know the entropy (at fixed individual and pairwise frequencies), which 
measures how many configurations cr really contribute to the Gibbs distribution of the Ising model. Computing 
the entropy is generally intractable for the direct Ising problem, unless correlations decay fast enough [l7| . 

Variants B and C are harder than A: full spin configurations give access to all -ftT-spin correlations, a knowledge which 
can be used to design fast network structure inference algorithm. Recently, a procedure to solve problem C was 
proposed, based on ideas and techniques coming from statistical physics [8]. The purpose of the present paper is to 
discuss its performances and limitations. 

It is essential to be aware of the presence of noise in the data, e.g. due to the imperfect sampling (finite number 
B of configurations). A potential risk is overfitting: the network of interactions we find at the end of the inference 
process could reproduce the mere noisy data, rather than the 'true' interactions. How can one disentangle noise from 
signal in the data? A popular approach in the statistics community is to require that the inferred interaction network 
be sparse. The rationale for imposing sparsity is two-fold. First, physical lattices are very sparse, and connect only 
close sites in the space; it is possible but not at all obvious that networks modeling other e.g. biological data enjoy 
a similar property. Secondly, an Ising model with a sparse interaction network reproducing a set of correlations is 
a sparing representation of the statistics of the data, and, in much the same spirit as the minimal message length 
approach [ij], should be preferred to models with denser networks. The appeal of the approach is largely due to the 
fact that imposing sparsity is computationally tractable. 

The criterion required by our procedure is not that the interaction network should be sparse, but that the inverse 
Ising problem should be well-conditioned. To illustrate this notion, consider a set of data, i.e. of frequencies pi,pki, 
and assume one has found the solution hi , Jm to the corresponding inverse Ising problem. Let us now slightly modify 
one or a few frequencies, say, pi2 — > p'12 = P12 + ^Pi2, and solve again the corresponding inverse Ising problem, with 
the results ft,'-, J^^. Let SJki = J'f.i — Ju and 5hi = h[ — hi measure the response of the interaction parameters to the 
small modification of pi2 alone. Two extreme cases are: 

• Localized response: the response is restricted to the parameters involving spins 1 and 2 only, i.e. 5hi, dh2, (5 J12 7^ 
0; it vanishes for all the other parameters. 

• Extended response: the response spreads all over the spin system, and all the quantities 5hi, SJki are non-zero. 

Intermediate cases will generically be encountered, and are symbolized in Fig. [lja)&(b). For instance, if the response 
is non-zero over a small number of parameters only, which define a 'neighborhood' of the spins 1, 2, we will consider 
it is localized. Obviously, the notion of 'smallness' cannot be rigorously defined here, unless the system size N can be 
made arbitrarily large and sent to infinity. 

Drawing our inspiration from the vocabulary of numerical analysis, we will say that the inverse Ising problem is 
well-conditioned if the response is localized. For a well-conditioned problem, a small change of one or a few variables 
essentially affects one or a few interaction parameters. On the contrary, most if not all interaction parameters of 
a ill-conditioned inverse Ising problem are affected by an elementary modification of the data. This notion must 
be distinguished from the concept of ill-posed problem. As we will see in Section |TT1 the inverse Ising problem is 
always well-posed, once an appropriate regularization is introduced: given the frequencies, there exists a unique set 
of interaction parameters reproducing those data, regardless of how hard it is to compute. 

Not all inverse Ising problems are well-conditioned. However, it is our opinion that only those ones should be 
solved. The reason is that, in generic experimental situations, only a (small) region of the system is accessible. 
Solving the inverse problem attached to this sub-system makes sense only if the problem is well-conditioned. If it is 
ill-conditioned, extending even by a bit the sub-system would considerably affect the values of most of the inferred 
parameters (Fig. HIc)). Hence, the interaction parameters would be very much dependent on the part of the system 
which is not measured! Such a possibility simply means that the inverse problem, though mathematically well-posed, 
is not meaningful. 
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FIG. 1: Schematic representation of a well-conditioned (a) and an ill-conditioned (b) inverse Ising problems. The gray areas 
symbolize the set of spins (full dots) whose interactions (links) and fields are affected by a change of the frequency pi2 of 
spins 1 and 2. The response is localized in (a) and extended in (b). Experiments usually measure a restricted part of the 
system (dashed contour) only (c). Increasing the size of the measured sub-system, e.g. by including the frequencies of the 
extra-variables 3 and 4, will modify most of the inferred interaction parameters if the problem is ill-conditioned. 



Interestingly, the response of the interactions to a change of a few correlations can be localized, while the response 
of the correlations to a change of a few interactions is extended. An example is given by 'critical' Ising models, where 
correlations extend over the whole system. However, the corresponding inverse Ising problem may be well-conditioned. 

The presence of noise in the data considerably affects the status of the inverse Ising problem. As we will see later, 
even well-conditioned problems in the limit of perfect sampling (i? — >■ oo) become ill-conditioned as soon as sampling 
is imperfect (finite B). The same statement holds for the sparsity-based criterion mentioned above: when data are 
generated by a sparse interaction network, the solution to the inverse Ising model is not sparse as a consequence of 
imperfect sampling. Only the presence of an explicit and additional regularization forces the solution to be sparse. 
In much the same way, the procedure we present hereafter builds a well-conditioned inverse Ising problem, which 
prevents overfitting of the noise. This procedure is based on the expansion of the entropy at fixed frequencies in 
clusters of spins, a notion closely related to the neighborhoods appearing in the localized responses. 

The plan of the article is as follows. In Section [ll] we give the notations and precise definitions of the inverse 
Ising problem, and briefiy review some of the resolution procedures in the literature. In Section |III[ we explain how 
the entropy can be expanded as a sum of contributions, one for each cluster (or sub-set) of spins, and review the 
properties of those entropic contributions. The procedure to truncate the expansion and keep only relevant clusters 
is discussed in Section |lVl The pseudo-codes and details necessary for the implementation of the algorithm can be 
found in Section |Vl Applications to artificial data are discussed at length in Section IVll Finally, Section IVlII presents 
some perspectives and conclusions. To improve the readability of the paper most technical details have been relegated 
to technical appendices. 

II. THE INVERSE ISING PROBLEM: FORMULATIONS AND ISSUES 

A. Maximum Entropy Principle formulation 

We consider a system of N binary variables, ai = 0,1, where i — 1,2,. ..,A^. The average values of the variables, pi, 
and of their correlations, pki, are measured, for instance through the empirical average over B sampled configurations 
of the system, see equations As the correlations pki are obtained from the empirical measure, the problem is 
realizable [ll|. Let p = {pi,Pki} denote the data. The Maximum Entropy Principle (MEP) [9j] postulates that the 
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probabilistic model P{(t) should maximize the entropy S of the distribution P under the constraints 

^P(cr) = l, ^ P((t) = p, , ^P{a) ak(7i ^ Pki ■ 



(3) 



In practice these constraints are enforced by the Lagrange multipliers A and J = {hi,Jki}. The maximal entropy 
is 1501 



S(p) = min max 

A,J P(cr) 



Y,P{a)logP{a) + X\^P(cT)-lj 

i \ cr / k<l \ cr 



)<yk<yi- Pki 



(4) 



The maximization condition over P shows that the MEP probability corresponds to the Gibbs measure Pj of the 
celebrated Ising model, 



-Hj.x„.,[<t|J] 



Z[3] 



where the energy function is 



ff/smc/[c|J] = — E ''''' ~ E 



(5) 



(6) 



k<l 



and Z[3] — exp(— i7/5j„g[(T|J]) denotes the partition function. The values of the couplings and fields 5l| are then 
found through the minimization of 



Slszng[J\p] = logZ[J] - E ^ E 



Pkl ■ 



(7) 



k<l 



over J. The minimal value of Sjsing coincides with S defined in 

The cross-entropy Sising has a simple interpretation in terms of the KuUback-Leibler divergence between the Ising 
distribution Pj[<t] and the empirical measure over the observed configurations, Pobsi""]- Assume B configurations of 
the N variables, cr'^, with r = 1, 2, . . . , _B, are sampled. We define the empirical distribution through 



1 

Pohs[(T] = ^E^'=^''^" ' 

T=l 

where 6 denotes the iV-dimensional Kronecker delta function. It is easy to check from ([7]) that 

SIs^ng[J\p] = - E [t] log Pj [ct] = - ^ Pobs [<t] log Pobs [t] + i?(Pobs||Pj) , 



(8) 



(9) 



where D denotes the KL-divergence. Hence, the minimization procedure over J ensures that the 'best' Ising measure 
(as close as possible to the empirical measure) is found. 



B. Regularization and Bayesian formulation 

We consider the Hessian of the cross-entropy Si sing ^ also called Fisher information matrix, which is a matrix of 
dimension ^N{N + 1), defined through 

^ d'^ Si sing ^ / Xi,i' Xi,k'l' \ MQ") 

8383 \ Xki,i' Xki.k'i' ) ' 

The entries of x are obtained upon repeated differentiations of the partition function Z[3], and can be expressed in 
terms of averages over the Ising Gibbs measure 

Xi,i' = (o-iCri')j - ((Tj)j(cri')j , (11) 
Xi,k'V = {<7iCrk'<7l')j-{(Ti)3{o'k'Cri')j, 
Xkl,k'l' — {<7k(^l(^k'<7l')j~{<7kCri)j{crk'<7l')j- 
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Consider now an arbitrary ^N{N + l)-diniensional vector x = {xi, x^i}- The quadratic form 

x''' • X • X = M ^a;j(CTi - (cri)j) + ^a;fci((Tfc(Ti - ((TfeCT;)j) I ) (12) 
\ \ i k<i / / J 

is semi-definite positive. Hence, Sjsing is a convex function. 

However the minimum is not guaranteed to be unique if x has zero modes, nor to be finite. To circumvent those 
difficulties, one can 'regularize' the cross-entropy Sjsing by adding a quadratic term in the interaction parameters, 
which forces x to become definite positive, and ensures the uniqueness and finiteness of the minimum of Si sing- In 
many applications, no regularization is needed for the fields hi. The reason can be understood intuitively as follows. 
Consider a data set where all variables are independent, with small but strictly positive means pi. Then, the empirical 
average products, pu, may vanish if the number B of sampled configurations is not much larger than {pkPi)~^- This 
condition is often violated in practical applications, e.g. the analysis of neurobiological or protein data [1, H, [lol |. 
Hence, poor sampling may produce infinite negative couplings. We therefore add the following regularization term to 

S I sinq ; 



j2 

k<l 



ll^JliPk{l-Pk)pi{l-Pi) ■ (13) 



The precise expression of the regularization term is somewhat arbitrary, and is a matter of convenience. The depen- 
dence on the piS in will be explained in Section llllBI Other regularization schemes, based on the Li norm rather 
than on the L2 norm are possible, such as 

l^\Jki\ VPk{l-Pk)pi{l-pi) ■ (14) 

k<l 

The above regularization is especially popular among the graphical model selection community fisj , and favors sparse 
coupling networks, i.e. with many zero interactions. 

The introduction of a regularization is natural in the context of Bayesian inference. The Gibbs probability Pj [cr] 
defines the likelihood of a configuration cr. The likelihood of a set of B independently drawn configurations cr^ is 
given by the product of the likelihoods of each configuration. The posterior probability of the parameters (fields and 
couplings) J given the configurations cr'^, t = 1,2, . . . , B, is, according to Bayes' rule, 

B 

Ppost[3\W^}] (X n P^^l , (15) 

T = l 

up to an irrelevant J-independent multiplicative factor. In the equation above, Pq is a prior probability over the 
couplings and fields, encoding the knowledge about their values in the absence of any data. Taking the logarithm of 
psp . we obtain, up to an additive J-independent constant, 

l0gPpo,t[J|{'^"}] = -B Sls,ng[J\p] + logPo[J] • (16) 

Hence, the most likely value for the parameters J is the one minimizing >S'/smg[J|p] — logPo[J]. The regularization 
terms (fT3)) and (fT4)) then correspond to, respectively, Gaussian and exponential priors over the parameters. In addition, 
as the prior is independent of the number B of configurations, we expect the strength 7 to scale as -g . The optimal 
value of 7 can be also determined based on Bayesian criteria [l^,!!!] (Appendix IXj) . 

We emphasize that the Bayesian framework changes the scope of the inference. While the MEP aims to reproduce 
the data, the presence of a regularization term leads to a compromise between two different objectives: finding an 
Ising model whose observables (one- and two-point functions) are close to the empirical values and ensuring that 
the interaction parameters J have a large prior probability Pq. In other words, a compromise is sought between the 
faithfulness to the data and the prior knowledge about the solution. The latter is especially important in the case 
of poor sampling (small value of B or data corrupted by noise). For instance, the regularization term based on the 
Li-norm (jl4p generally produces more couplings equal to zero than its L2~norm counterpart (|13p . This property 
is desirable if one a priori knows that the interaction graph is sparse. Hence, the introduction of a regularization 
term can be interpreted as an attempt to approximately solve the inverse Ising problem while fulfilling an important 
constraint about the structure of the solution. We will discuss the nature of the structural constraints corresponding 
to our adaptive cluster algorithm in Section HVCl 

Knowledge of the inverse of the Fisher information matrix, allows for the computation of the statistical 

fluctuations of the inferred flelds and couplings due to a finite number B of sampled configurations. According to 
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the asymptotic theory of inference, the posterior probabihty Ppost[J|{<T'^}] over the fields and couplings becomes, 
as B gets very large, a normal law centered in the minimum of S'/singlJIp]- The covariance matrix of this normal 
law is simply given by ^X"^- Consequently the standard deviations of the fields hi and of the couplings Jki are, 
respectively. 



Sh, = -^^"(^""^ , 6Jki (X ^)ki,ki ■ (17) 

In order to remove the zero modes of x ^md have a well-defined inverse matrix x~^i the Ising model entropy Sising 
([7]) can be added a regularization term, e.g. (fT3|) . which guarantees that % is positively defined. 

The Fisher information matrix, x, can also be used to estimate the statistical deviations of the observables coming 
from the finite sampling. If the data were generated by an Ising model with parameters J, we would expect, again in 
the large B setting, that the frequencies Pi,pki would be normally distributed with a covariance matrix equal to X- 
Hence, the typical uncertainties over the 1- and 2-point frequencies are given by 



. ,1 ,/ (a.)j(l-(^.)j) . /I , {^k<Ji)jil~{^k<Ji)j) 

opi = = Y ^ ' ^Pki = \Ib ^^^'''^ " V B ■ ^ ' 

In practice, we can replace the Gibbs averages above with the empirical averages pi and pu to obtain estimates for 
the expected deviations. These estimates will be used to decide whether the inference procedure is reliable, or leads 
to an overfitting of the data in Section |Vll 

C. Methods 

The inverse Ising problem has been studied in statistics, under the name of graphical model selection, in the machine 
learning community under the name of (inverse) Boltzmann machine learning, and in the statistical physics literature. 
Different methods have been developed, with various applications. Some of the methods are briefly discussed below. 

A direct calculation of the partition function Z[3] generally requires a time growing exponentially with the number 
N of variables, and is not feasible when N exceeds a few tens. Inference procedures therefore tend to avoid the 
computation of Z[3]: 

• A popular algorithm is the Boltzmann learning procedure, where the fields and couplings are iteratively updated 
until the averages (cri)j's and {iTkUi)j^s, calculated from Monte Carlo simulations, match the imposed values 
p^ . The number of updatings can be very large in the absence of a good initial guess for the parameters J. 
Furthermore, for each set of parameters, thermalization may require prohibitive computational efforts for large 
system sizes iV, and problems with more than a few tens of spins can hardly be tackled. Finally, learning data 
exactly leads to overfitting in the case of poor sampling. 

• the Pseudo-Likelihood-hased algorithm by Ravikumar et al. [isl . [l6l | is an extension to the binary variable case 
of Meinshausen and Biihlmann's algorithm [2(| and is related to a renormalisation approach introduced by 
Swendsen The procedure requires the complete knowledge of the configurations {c'^} (and not only of 
the one- and two-point functions p). The starting point is given by well-known Callen's identities for the Ising 
model, 



/ 1 \ 1 'sF- 1 

(cr,:)j 



1 ^ 1 

/ V (19) 

\ 1 + CXp J2.j JijCTj ~ ' J r = ll + exp Jtj'^j - 

where the last approximation consists in replacing the Gibbs average with the empirical average over the sampled 
configurations. Imposing that the Gibbs average (cr,;) j coincides with pi is equivalent to minimizing the following 
pseudo-likelihood over the field hi, 

Si,PL[hi,{Jij, j ^i}]^ ■^'^log 1 + exp Jjjcrj + /li j - hipi - JijPij . (20) 



The minimization equations over the couplings Jjj, with j ^ i (and fixed z), correspond to Callen identities 
for two-point functions. Informally speaking, the pseudo-likelihood approach simplifies the original iV-body 
problem into iV independent 1-body problem, each one in a bath of iV — 1 quenched variables. Note that the 
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couplings Jij and Jji (found by minimizing Sj^pL) will generally not be equal. However, as far as graphical 
model selection is concerned, what matters is whether J^j and Jji are both different from zero. 

The pseudo-entropy Si^pL is convex, and can be minimized after addition of a Li^norm regularization term 
[H m, The procedure is guaranteed to find strong enough couplings [52| in a polynomial time in N ^ 

provided that the data were generated by an Ising model (which is usually not the case in practical applications) 
and that a quantity closely related to the susceptibility x (fTtJ|) is small enough. The latter condition holds for 
weak couplings and may break down for strong couplings [l6|. For a review of the literature in the statistics 
community, see [T3| . 

In specific cases, however, the partition function can be obtained in polynomial time. Two tractable examples are: 

• Mean-field models, which are characterized by dense but weak interactions. An example is the Sherrington- 
Kirkpatrick model where every pair of spins interact through couplings of the order of 7V~^/^ [23] . The entropy 
^[p] coincides asymptotically with 

W(P) = ^ log detM(p), where M,,{p) = ^=M=pi^==, (21) 

which can be calculated in 0{N^) time [HI [13]. Expression (|21l) has been obtained from the high temperature 
expansion |23I425| of the Legendre transform of the free energy , and is consistent with the so-called TAP 
equations [2q|. The derivative of Smp with respect to p gives the value of the couplings and the fields, 

dSMF ^ {M-%1 
9pki ^/Pk{l-Pk)pi{l-Pi) ' 

dSl\IF \r-^ I T \ f P^~ \ \ /on\ 

where Cy — pij — piPj is the connected correlation. From a practical point of view, expression (j21l) is a good 
approximation for solving the inverse Ising problem (28l - [30| on dense and weak interaction networks , but fails 
to reproduce dilute graphs with strong interactions. 

• Ising models on tree-like structures, i.e. with no or few interaction loops. Message passing methods are guar- 
anteed to solve the associated inverse Ising problems. For trees, the partition functions can be calculated in a 
time linear in N . Sparse networks of strong interactions with long-range loops, such as Erdos-Rcnyi random 
graphs, can also be successfully treated in polynomial time by message-passing procedures 5, 31, 32]. However, 
these methods generally break down in the presence of strongly interacting groups (clusters) of spins. 

When an exact calculation of the partition function is out-of-reach, accurate estimates can be obtained through 
cluster expansions. Expansions have a rich history in statistical mechanics, e.g. the virial expansion in the theory 
of liquids [ssl. [3^. However, cluster expansions suffer from several drawbacks. First, in cluster variational methods 
[31I l35j , the calculation of the contributions coming from each cluster generally involves the resolution of non trivial 
and self-consistent equations for the local fields, which seriously limits the maximal size of clusters considered in the 
expansion. Secondly, the composition and the size of the clusters is usually fixed a priori, and does not adapt to the 
specificity of the data [l^l . The combinatorial growth of the number of clusters with their size entails strong limits 
upon the maximal sizes of the network, N , and of the clusters, K . Last of all, cluster expansions generally ignore the 
issue of overfitting. 

Recently, we have proposed a new cluster expansion, where clusters are built recursively, and are selected or 
discarded, according to their contribution to the cross-entropy S This selection procedure allows us to fully 
account for the complex interaction patterns present in experimental systems, while preventing a blow-up of the 
computational time. The purpose of this paper is to illustrate this method and discuss its advantages and limitations. 

III. CLUSTER EXPANSION OF THE CROSS-ENTROPY 

A. Principle of the expansion 

In this Section, we propose a cluster expansion for the entropy S'(p). A cluster, F, is defined here as a non-empty 
subset of (1,2,..., N). To illustrate how the expansion is built we start with the simple cases of systems with a few 
variables {N = 1,2), in the absence of the regularization term (fT5|) . 
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FIG. 2: Decomposition of the cross-entropy ^(p) for a system of 4 spins, indicated with different colors, as the sum of cluster 
contributions. Each cluster-entropy A5'r(p) depends only on the one- and two-point frequencies of the variables in the cluster: 
it can be calculated in a recursive way, see main text. Dotted clusters are decomposed into a diagrammatic expansion in Fig.|3] 



Consider first the case of a single variable, = 1, with average value pi. The entropy S{pi) can be easily computed 
according to the definitions given in Section |IT1 with the result 



Sii)iPi 



min Si sing [hi\pi] 

hi 



mm 

hi 



log (1 H 

= -pilogpi - (1 -pi)log(l -pi) 



hipi 



(23) 



We recognize the well-known expression for the entropy of a 0-1 variable with mean value pi. For reasons which 
will be obvious in the next paragraph, we will hereafter use the notation AS'(i)(pi) to denote the same quantity as 
S'(i)(pi). The subscript (1) refers to the index of the (unique) variable in the system. 

Consider next a system with two variables, with mean values pi,P2 and two-point average pi2. The entropy 
'S'(i,2)(PiiP2,Pi2) can be explicitly computed: 



S{l,2)iPl,P2,Pl2) = min Slsing[hi,h2,Jl2\pi,P2,Pl2] 
hi,h2,Ji2 



mm 

hi,h2,Ji2 



log(l 



„hl+h2 + ,7l2^ 



hipi - h2P2 - J12P12 



-{Pl - P12) log (pi - P12) - {P2 ~ P12) l0g(p2 - P12) 
Pl2l0gPl2 - (1 -Pl -P2 +P12) l0g(l ~P1-P2 +P12) 



(24) 



We now define the entropy AS'(i 2) of the cluster of the two variables 1,2 as the difference between the entropy 
'S'(i,2)(piiP2,Pi2) calculated above and the two single- variable contributions AS'(i)(pi) and A5'(i)(p2) coming from the 
variables 1 and 2 taken separately: 



AS'(l,2)(Pl,P2,Pl2) = 5'(l,2)(Pl,P2,Pl2) - AS'(i)(pi) - AS'(2)(p2) 



(25) 



In other words, AS(^i 2) measures the loss of entropy between the system of two isolated variables, constrained to have 
means equal to, respectively, pi and p2, and the same system when, in addition, the average product of the variables 
is constrained to take value pi2- Using expressions ([25]) and ([M)) . we find 



A5(i^2)(Pl,P2,Pl2) = -(Pl-Pl2)l0g 



Pi ~ P12 
Pi - P1P2 



{P2 -pi2)log 



P2 - P12 



P12 log ( ) - (1 - Pi - P2 + P12) log 

\PlP2 



.P2 
l-Pl 



P1P2 
P2 + PI2 



1 -pi -P2 +P1P2 



(26) 



The entropy of the cluster (1,2) is therefore equal to the Kullback-Leibler divergence between the true distribution 
of probability over the two spins and the one corresponding to two independent spins with averages pi and p2- It 
vanishes for pi2 = PiP2- 

Formula (^5]) can be generalized to define the entropies of clusters with larger sizes TV > 3. Again p = {pi.Pki} 
denotes the data. For any non-empty subset F including 1 < K < N variables, we define two entropies: 
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• the subset-entropy 5'r(p), which is the entropy of the subset of the K variables for fixed data. It is defined as 
the right hand side of when the variable indices, are restricted to V. Note that, when the subset F 
includes all N variables, S'r(p) coincides with S'(p). 

• the cluster-entropy AS'r(p), which is the remaining contribution to the subset-entropy S'r(p), once all other 
cluster-entropies of smaller clusters have been substracted. The cluster entropies are then implicitly defined 
through the identity 

5r(p)- 5] A5r'(p) , (27) 

r'cr 

where the sums runs over all 2^ — 1 non-empty clusters T' of variables in T. 

Identity ([27|) states that the entropy of a system (for fixed data) is equal to the sum of the entropies of all its clusters. 
Figure [5] sketches the cluster decomposition of the entropy for a system of iV = 4 variables. 

For F — (1), equation (P7)) simply expresses that S'(i)(j5i) = AS'(i)(pi). For F = (1, 2), equation (P7)) coincides with 
(p5)) . For F = (1, 2, 3), we obtain the definition of the entropy of a cluster made of a triplet of variables: 

^'5'(l,2,3)(Pl,P2,P3,Pl2,Pl3,P23) = <5'(l,2.3) (Pl , P2 , P3 , Pl2, Pl3 , P23) " AS'(i)(pi) - AS(2){P2) " AS'(3)(p3) 

- AS'(i^2)(Pl,P2,Pl2) - AS'(i^3)(pi,P3,Pl3) - AS'(2^3)(P2,P3,P23) ■ (28) 

The analytical expression of the cluster-entropy AS'(i^2.3) is given in Appendix IbI 
The examples above illustrate three general properties of cluster-entropies: 

• the entropy of the cluster F, AS'p, depends only on the frequencies Pi,Pij of the variables i,j in the cluster F 
(and not on all the data in p). 

• the entropy of a cluster with, say, K variables, can be recursively calculated from the knowledge of the subset- 
entropies Sr' (p) of all the subsets F' e F with K' < K variables. According to Mobius inversion formula, 

A5r(p)- 5](-l)^'-^5r'(p) . (29) 
r'cr 

• the sum of the entropies of all 2^ — 1 clusters of a system of TV spins is the exact entropy of the system, see 
(HZl) with F = (1,2, ...,Ar). 

In practice, to calculate 5(p), one first computes the partition function Z[J] by summing over the 2^ configurations 
cr and, then, minimizes S'/smg[J|p] ([7]) over the interaction parameters J. The minimization of a convex function of 
^K{K + 1) variables can be done in time growing polynomially with K. Moreover the addition of the regularization 
term (fT3|) can be easily handled. The limiting step is therefore the calculation of Z, which can be done exactly for 
clusters with less than, say, K = 20 spins. 

Hence, only a small number of the 2^ — 1 terms in ((27| can be calculated. In the present work we claim that, in a 
wide set of circumstances, a good approximation to the entropy S{p) can be already obtained from the contributions 
of well-chosen clusters of small sizes, 

S{p) ~ ^^r(p) , (30) 
We will explain in Section HVl how the list of selected clusters, L, is established. 

B. The reference entropy So 

So far we have explained how the entropy S{p) can be expanded as a sum of contributions AS'r(p) attached to the 
clusters F. In this Section we present the expansion against a reference entropy, 5*0 (p), and two possible choices for 
the reference entropy. 

The idea underlying the introduction of a reference entropy is the following. Assume one can calculate a (rough) 
approximation S'o(p) to the true entropy S{p). Then, the difference ^(p) — S'o(p) is smaller than iS'(p), and it makes 
sense to expand the former rather than the latter. We expect, indeed, the cluster-entropies to be smaller when the 
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reference entropy 5*0 (p) is substracted from the true entropy. We substitute the original definition ([27)1 with the new 
definition 

S{p) = 5o(p) + ^^r(p) . (31) 

rc(l,2,...,Af) 

With this new definition, the values of the cluster-entropies A^r depend on the choice of Sq] the previous definition 
(P7)l is found back when 5*0 = 0. The procedure for the calculation of the cluster-entropies AS'r(p) is the same as 
in Section fill A| upon replacement of S{p) with S'(p) — S'o(p)- The three properties of the cluster expansion listed 
above still hold. 

Our final estimate for the entropy will be, compare to (1301) . 



^(p)~5o(p) + ^A5r(p) . (32) 

reL 

Hence, the cluster expansion is a way to calculate a correction to the approximation Sq to the true entropy S. 
Obviously, the introduction of a reference entropy is useful in practice only if S'o(p) can be quickly calculated for the 
entire system of size N. In other words, the computational effort required to obtain Sq should scale only polynomially 
with N. A natural choice for the reference entropy is Sq = Smf (PT|) - the mean-field entropy discussed in Section 
III CI As the calculation of Smf requires the one of the determinant of the matrix M(p), it can be performed in a 
time scaling as only. In addition, we expect Smf to be a sensible approximation to S for systems with rather 
weak interactions. Corrections coming from the strongest interactions will be taken care of by the cluster expansion. 

Regularized versions of the Mean Field entropy can be derived as follows. First, we use the MF expression for the 
cross-entropy at fixed couplings Jki and frequencies pi, see ([7]) and [24], to rewrite 

'S'/si„g({pJ,{Jfe/}) = -^logdet(ld- J') Jfe/ (pfe/ -pfeP/) , where J^; = Jki\/pk{^ - pk)pi{l - pi) , (33) 

k<l 

and Id denotes the A^-dimensional identity matrix. We consider the L2-iiorm regularization (|13p. The entropy at 
fixed data p is 



5o(p) = min 

{Jkl] 



mm 



Slstng{{Pi\AJM}) +lYjklPk{i -Pk)Pl{l-Pl) 



k<l 



i logdet(ld - J') - iTrace(j' • M(p)) 4- |Trace(j')^ 



(34) 



where M(p) is defined in (|2ip . The optimal interaction matrix J' is the root of the equation 

(Id- J')"^ - A/(p) +7 J' = . (35) 

Hence, J' has the same eigenvectors as M(p), a consequence of the dependence onp,; we have chosen for the quadratic 
regularization term in ((T^ . Let jq denote its q*'* eigenvalue, and rhq = (1 — jg)^^. Then, 

1 ^ 

So{pn) ^ ^Yi^ogrhq + l-rhq) , (36) 

9=1 

where rhq is the largest root of rn?q — rhq{mq — 7) = 7, and is the g*'' eigenvalue of Af (p). Note that rhq — rriq 
when 7 = 0, as expected. 

C. Properties of the cluster entropies ASr 

1. Diagrammatic expansion m powers of the connected correlations 

A better understanding of the cluster expansion and of the role of the reference entropy Sq can be gained through the 
diagrammatic expansion of the entropy S{p) in powers of the connected correlations (high-temperature expansion). 



Cij — Pij PiPj 



(37) 
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FIG. 3: Diagrammatic expansion of the cross-entropy S{p). A cluster-entropy (see Fig.[2J is the infinite sum of aU the diagrams 
in a box (dashed contour), linking the K sites in the cluster. Each link in a diagram carries Mij, and each site pi; in addition, 
each diagram carries a multiplicative factor, which is a function of the Pi's. In the Figure only one cluster among all (^) 
clusters is represented. Only the first diagrams with non-zero coefficients are drawn. Loop diagrams are analytically summed 
up and removed from the expansion through the reference entropy So = Smf', Eulerian circuit diagrams (brown/gray) are 
partly removed, see main text. Diagrams giving the largest contributions to the universal central peak of the cluster-entropy 
distribution (Appendix[C]) are shown in bold. 



Note that the entry Mij of the matrix M defined in (j21[) vanishes linearly with Cy. Thus, an expansion in powers of Cij 
is equivalent to an expansion in powers of . A procedure to derive in a systematic way the diagrammatic expansion 
of S'(p) is proposed in [s^]. The diagrammatic expansion provides a simple representation of the cluster-entropies, in 
which the entropy S{p) can be represented as a sum of connected diagrams (Fig. [3]). Each diagram is made of sites, 
connected or not by one or more edges. Each point symbolizes a variable, and carries a factor pi. The presence of 
n{> 0) edges between the sites k and I results in a multiplicative factor {cki)"'- The contribution of a diagram to the 
entropy is the product of the previous factors, times a function of the pi specific to the topology of the diagram, see 
[30j . Diagrams of interest include (Fig. [S]): 

• the N single-point diagrams, whose contributions are AiS'(i)(pi); 

• the 'loop' diagrams, which consist of a circuit with K edges going through K sites ii —> 12 — >...—>■ i/f ii, 
whose contributions to the entropy are 

Sioop{p\ii,i2,---,iK) = (-1)^-^M,,,,,M,,,,3 ...M,^_,,,^M,^,,, ; (38) 

• the Eulerian circuit diagrams, for which there exists a closed path visiting each edge exactly once; 

• the non-Eulcrian diagrams, with the lowest number of links (smallest power in M). 

The entropy for two variables S{pi,pj,pij) ([M)) . is the sum of the two single-point diagrams i and j, plus the 
sum of all connected diagrams made of the two sites i and j with an arbirtrary large number of edges {n > 2) in 
between (first two columns in Fig. [3]). According to (I25p . the cluster-entropy \S(ij){pi,pj,pij) is equal to the latter 
sum (second column in Fig. [Sj . More generally, the entropy of a cluster A^r (p) is the infinite sum of all diagrams 
whose sites are the indices in F. 

We now interpret the Mean Field expression for the entropy, Sm f , in the diagrammatic framework. We start from 
identity ((2T|) . and rewrite, 

, , 1 1 r / NT -^r^ -Trace|"(ld-M)^l , , 

Smf{p) = ^Trace log M = -Trace log [id - (id - M)] = — ■ (39) 

K>1 
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L=4 L=4 L=6 




FIG. 4: Examples of contour paths for three different graphs. Spins are labelled by 1, 2, 3, 4 and first-neighbor interactions are 
represented by bold lines. The contour path is depicted with a dotted line. The contour length L, which can be calculated as 
the sum of distances along the contour path (dotted arcs have length 2) is indicated above each graph. Different clusters may 
have the same contour path and contour length. Left: (1,3) and (1,2,3) have contour length L — 4, while (1,2) has L = 2. 
Middle: clusters (1,3), (1,2,3), (1,2,3,4), (1,3,4) have the same contour path. Right: (1,2,3,4) has length L — 6. 



Using the fact that the diagonal elements of M are equal to unity, the term corresponding to K = 1 above vanishes. 
For K > 2, we have 

-Trace[(M-Id)'^] - - ^ {K,,, - M,,^,,){S,,,,^ - M,,^,,) . . . {6^^_,^,^ - M,^_,^,^){6^^^,, - M,^^,,) 

il,i2,---,iK 

= i~'^f~'M^„^,m,,^,■■.M^^^„^^m^,^,, (40) 

ii.i2,....iK 

where the matrix M has the same off-diagonal elements as M, and has zero diagonal elements. Each term in the 
above sum corresponds to an Eulerian circuit over K' < K sites, where K' is the number of distinct indices in 
(ii, 12, • . • , Ik)- Note that the same circuit can be obtained from different if-uplets of indices. Consider for instance 
the longest circuits, obtained for K' = K, i.e. all distinct indices. 2K different i^T-uplets («i,«2, ■ • ■ , ^k) correspond to 
the same circuit, as neither the starting site nor the orientation of the loop matter. For instance, «i — )■ «2 *3 *i, 
*2 ^ *3 ~> «i ~> *2, *i — ^ *3 — >■ ^2 ^ ii, ■■■ are all equivalent. This multiplicity factor 2K precisely cancels the 2K at 
the denominator in p9p. The contribution corresponding to a circuit therefore coincides with expression psp for the 
loop entropy. We conclude that 

• SMpip) sums up all loop diagrams exactly; 

• 'S'mf(p), in addition, sums up Eulerian circuit diagrams, but with weights a priori different from their values in 
the cross-entropy S{p) [53[. An exception is the three- variable Eulerian diagram shown in Fig. [31 whose weights 
in Smf and S coincide. 

• no non-Eulcrian diagram is taken into account in Smf{p)- 

As a conclusion, the diagrammatic expansion provides a natural justification for the choice of the reference entropy 
'S'o(p) = SMpip)- In addition, it provides us with the dominant contribution to the cluster-entropies once the Mean- 
Field entropy is substracted, see Fig. [3] A detailed study of those dominant contributions is presented in Appendix [Cl 

2. Dependence on the cluster size and on the interaction path length 

To reach a better understanding of what the cluster-entropy means, we consider the case of finite-dimensional Ising 
model, e.g. with coupling J > between nearest-neighbors on a Z?-dimensional lattice. We call ^ the correlation 
length: the connected correlation c between two sites at large distance d decays as ^ exp(— c?/^). We want to 
characterize the behavior of the cluster-entropy AS'r when the K sites in the cluster T are far apart on the lattice. We 
first choose no reference entropy (5*0 = 0). According to the above diagrammatic expansion, the lowest order diagram 
(in powers of c) with K sites has the loop topology. We look for the shortest closed path joining all the sites in T; 
let L{T) be this contour length, that is, the sum of the distances between neighboring sites along the path (Fig. |4|). 
Then, according to the largest contribution (in absolute value) to the cluster entropy is 

A5r ^ A{p, K) (-1)^-1 exp (- L{T)/^) , (41) 

where A is a positive function K and of p, the representative value of the frequencies pi of the variables in T. We 
conclude that the sign of the cluster-entropy depends on the parity of the number of sites. Furthermore, A^r decreases 
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exponentially fast (in absolute value) with the length of the shortest path joining the sites in the cluster. As soon as 
one site is very far away from the remaining K — I ones, the cluster-entropy is small. 

As a consequence, the sum ([ST)) is alternate, and we expect cancellation between contributions coming from clusters 
sharing the same shortest path, but with different sizes. This crucial point is perfectly illustrated by the one- 
dimensional Ising model. The correlation between two sites at distance dij = j — i is, in one dimension, Cij = 
y^Pi(r^^Pi)pJ(T^^Pjyexp(— dij/^) (Appendix|F|. The matrix M defined in (pTt has elements 

My = e"'''^/^ (42) 

Then, according to ([551) , the largest contribution (in absolute value) to the cluster entropy of a cluster containing the 
K spins ii < 12 < . . . < j/f is given by (PT|) with 

L(r = (n,Z2,...,Zfc)) ^2{ik~ii) , (43) 
and A{p,K) — i. An exact calculation, reported in Appendix [Fl shows that 

A%,,....,,) = (-l)^^^(exp(-^^)) , (44) 

where F is a smooth function given in (jFll[) . such that F{0) = F'{0) = 0,F"(0) = —1. This identity is in agreement 
with (HT|) . since the shortest path encircling all sites has length L{T) = 2{iK — ii)- Hence, all clusters sharing the same 
'extremities, i.e. the same values of ii and iK, have the same entropies in absolute value. The sign is determined 
by the parity of K as mentioned above. Let ik — ii = d. T = (ii,i_R-) is the unique cluster of size K = 2 having its 
'extremities' equal to «i and iK', its entropy is AS"^*^ -^^ — F{exp{—d /£,)). There is (d— 1) clusters of size K = 3 with 

the same extremities, each having an entropy equal to — A5^* ^ More generally, there are (^"2) clusters of size K 
with the same extremities, each having an entropy equal to (— 1)^'^^AS'^* ^ The total contribution to the entropy 
of all those clusters (at fixed extremities ii,ifc) is 

A%xed - E(-l)"-^ (k' 2) ^^^^-) = - ^^^^-) - I ^^1'^^^ IdZl' (4^) 

K=2 ^ ^ ^ - 

The above calculation nicely exemplifies the cancellation of cluster-entropies. The contributions of all clusters sharing 
the same extremities exactly compensate each other, unless those extremities are nearest-neighbors on the lattice. We 
show in Appendix [F] that this exact cancellation is a consequence of the existence of a unique interaction path along 
the unidimensional chain. As a result, in dimension D = 1, the cross-entropy S is simply the sum of the entropies of 
the clusters made of nearest neighbours. 

In the presence of a reference entropy, Sq — S]\if, the asymptotic scaling of the cluster-entropy with its contour 
length L changes, as the dominant contribution coming from loop diagrams is removed from the cluster expansion 
and absorbed into Sq. The subleading contribution to the cluster-entropies is depicted in bold in Fig. |3] and derived 
in Appendix [Cl In dimension D = 1, formula (I4ip is replaced with 

A%,,,,...,,;,.) = A'(p,i^) (-1)^-1 exp ( - 3(ifc - *i)/C) . (46) 

Note the sharper asymptotics decay with the distance between the extremities of F than in the absence of reference 
entropy. As expected, the terms in the expansion of 5' — are smaller than the one in the expansion of S alone. 
Remarkably, the exact cancellation property studied above also holds when the reference entropy is non-zero, as 
proven in Appendix IF] 

In dimension £) = 2 or higher, more than one interaction path connect any two spins, and cluster-entropies with the 
same contour path do not cancel exactly as in the D = 1 case. However, partial cancellations are present. Figure [S] 
shows the values of the cluster-entropies versus the length of the shortest path, L{r), for a small bidimensional 3x3 
grid. For such a small system all data p and cluster-entropies ASr (with up to K = 9 spins) can be calculated by 
exact enumeration methods. We observe that: 

• |A5r| is sensitive to the value of Lp more than to the size K of the cluster; 

• lA^rl rapidly decreases with Lr] 

• the values of the cluster-entropies reflect the structural properties of the lattice, e.g. clusters made of central 
sites, such as 4-5, have a larger entropy than the clusters including pairs of edge spins, such as 1-2; 
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FIG. 5: Cluster entropy contribution ASr for a 3 x 3 grid (top-right) with nearest-neighbour couplings J — 1.777 (in units 
of fcflT) as a function of the contour length Lr of the shortest closed path on the lattice joining the spins in F. To illustrate 
cancellation effects, some ASr are labelled with the indices in F, see main text. The values of J and of the fields hi = 
^ \ Jij SH ''■re chosen to make the system critical in the infinite grid size limit, see Section TVI CI 



• the sign of AS'r changes with the parity of the size of the cluster. 

As a result, the contributions to the entropies coming from the clusters sharing the same path, of length i, partially 
cancel each other. Consider for example the path 1-2-4-5 of length L = 4; all 7 clusters that share this path have 
similar lAS'j, ranging between 0.0024 and 0.0046, and so does their sum, A5'(2,4) -I- AS'(i^5) + ^'S'(2,4,5) + AS'(i^4 5) -|- 
^'5'(i,2,5) + ^<S'(i_2,4) + AS'(i_2,4,5) = —0.00242 [5J]. The sum of the entropies of the clusters sharing the same path is 
generally of the same order of magnitude as, or even smaller than the single contributions. Figure [5] shows that the 
sum of the 12 clusters of contour length L = 2 and of the 4 square-path contributions ( lAS'j > .0024) approximates 
the entropy within 10~^. 



IV. TRUNCATION OF THE CLUSTER EXPANSION 



In this Section we present a truncation scheme for the cluster expansion, which consists in discarding all clusters 
with entropies smaller than a threshold Q. We explain why this scheme is efficient, in particular in the presence of 
sampling noise, and robust against strong correlations in the data (large correlation length). The behavior of the 
expansion as a function of the threshold is discussed. 



A. Schemes for truncating the expansion in the noiseless case 

Expansion ([77|) for S'(p) includes 2^ — 1 terms, and is useless unless an accurate truncation scheme is available. A 
naive truncation consists in keeping the contributions from the clusters with < K spins, where K is an arbitrary size. 
This procedure was applied to neurobiological data (with N < ^0, K = 7) in [l^, which are characterized by large 
negative fields. However it suffers from two drawbacks. First, the combinatorial growth of the number of clusters 
with N and K impedes its application to very large systems. Secondly, the truncation does not converge properly 
with increasing K if the correlation length of the system is large. 

As an illustration, consider again the ID-ferromagnetic Ising model, with correlation length ^. The sign of ASr 
alternates with the parity of the size K of F; its modulus decays asymptotically as exp(— fid/^), where d is the 
maximal distance between any two spins in F (Section IIII C 2[) . and 17 = 2 if there is no reference entropy (Sq = 0), 
ri = 3 if So = Smf- Let AS(i^) be the sum of ASr over all the clusters F with K spins. In the thermodynamic limit 
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{N oo), 

Consider then the series summing all ■^AS{K) with ii' > 2. The series is convergent if ^ < = j^^i ^^'^ divergent 
when ^ > ^c- In the latter case, for a finite-TV system, the maximum of |AS'(i^)| is exponentially large in TV, and 
is reached in if ~ ^. As a consequence, for £, > ^c, the sum ([?7)) can not be truncated according to the size of the 
clusters. This result is not specific to the dimension unity, and holds for other interaction networks. The expansion 
of S'(p) defines an alternate series, and the order of its terms matters for its convergence in the TV — > cx) limit. For 
an Ising model on a generic lattice with fixed degree (number of neighbours) v, the largest value of ^ such that the 
series P7)) (after division by TV) is absolutely convergent in the TV — oo limit is = f Appendix [P]) . 

A better truncation scheme consists in keeping cluster-entropies larger than a threshold Q only. Let us define 

5(p,e)= ^^r(p). (48) 

rc(i,2,...,Ar) 

|ASr(p)|>e 

The rationale is that, due to the properties of the cluster entropies and to the cancellation mechanism exposed in 
Section [ill C 21 summing large cluster-entropies may provide a good approximation to the true value of >5'(p). In the 
D = 1 Ising model case, the exact value of S{p) is, indeed, obtained as soon as 8 < AS'(i 2)- We show in Fig. [B] the 
residual error in the cross-entropy due to the truncation as a function of the threshold Q for the same small D = 2 
grid as in Fig. [5l The error S{p, Q) — S{p) is very small, and equal to 10~^ when all clusters with contour length 
smaller than 4 are taken into account. As Q is made smaller, clusters with larger contour lengths are summed up, 
and the error reaches the numerical accuracy ^ 10^^^. On top of this trend, positive fluctuations, corresponding to 
larger errors, arise when not all the clusters with the same interaction path (and length L) are summed up, and the 
cancellation of those contributions is not effective (Fig. [Bland caption). We will study in more details this phenomenon 
in Section [IVDI 

We now explain why the presence of noise in the data provides a compelling argument supporting the introduction 
of the cut-off 6. 



B. Distribution of small cluster-entropies in the presence of noisy data 

In this Section, we investigate how limited sampling affects the values of the cluster-entropies. We assume that B 
configurations cr'^ are sampled from the Gibbs distribution of an Ising model with interaction parameters J using a 
Monte Carlo procedure to generate the data p. 



1. Universality at small \ AS\: numerical evidence 



The empirical correlations, = pij —piPj, differ from the Gibbs correlations, {(Ji<Jj)3 — {o'i)3{o'j)j, by random 
fluctuations of amplitude 



where p is the typical value of the pi. For pairs i,j with weak Gibbs correlations (< cb in absolute value), the 
experimental correlations are dominated by the noise. As a consequence, the distribution of the cluster-entropies is 
universal for small AS. Its structure is a consequence of the noise in the data, and not of the interaction network of 
the model used to generate the data. 

Figure [71 shows the histograms H (full distributions) of the entropies AS'^j j- j.) for the K = 3-clusters for a one- 
dimensional Ising model, for three values of the numbers B of sampled configurations. The histograms are made of 
two components: a bell-shaped distribution at small |AS'|, and isolated peaks at larger |AS'|. The cluster-entropies 
corresponding to the isolated peaks have the same values as in the perfect sampling case {B — oo, impulses). When 
B increases, the bell shapes move towards smaller entropies (in the log-scale of Fig. [7]), and more peaks are unveiled 
in H. 

We show also in Fig. [Tithe histograms Hjg for a system of Independent Spins (IS), with the same pi's as the 
original system, and the same number B of sampled configurations. Contrary to H, His does not exhibit isolated 
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FIG. 6: Effect of truncation on the bidimensional 3x3 grid with J = 1.777 (units of fcsT) and fields hi = — X^jx^^i) "^^i [49| . 
for these parameters values the model has a phase transition between the paramagnetic and ferromagnetic phase and therefore 
the correlation length is proportional to the linear size of the system. Top: difference between the truncated and the true 
cross-entropies as a function of the cut-off on the absolute cluster-entropies, Q. Bottom: contour lengths L(r) vs. O. The 
fluctuations of 5(0) — S reflect the cancellation phenomenon. Summation of the 12 clusters of nearest-neighbours with K = 2 
and L = 2 gives S{0 — 0.1) — S ~ 0.01, of the 21 clusters contributions corresponding to squared paths, 6.17. 1-2-4-5), with 
K = 2,3,4: and L = A gives S(e = 0.002) - 5 ~ 10"''. Fluctuations arise if only a part of the clusters that share the same 
interaction path are summed up, and cancellation is incomplete. For instance, fixing = 0.0025 discards (1, 2, 4, 5), which has 
the same interaction path as (2,4). 



peaks at well-defined, _B-independent cluster-entropies. The histograms Hjs concentrate around smaller \AS\ as the 
number B of configurations increases. Note that the histograms Hjs roughly correspond to the bell-shape parts of 
the distributions H for the same value of B. We have checked that these features are largely independent of the 
particular sample and of the cluster size, K. 

The histograms Hjs depend on B through their standard deviation, ais{B). The calculation of (Jis{B) from the 
dominant contribution (IC7p in the diagrammatic expansion of the cluster entropies (Section IIII C ip is presented in 
Appendix lEl We obtain that, for clusters of size K and in the case of uniform averages pi = p different from 0, i, 
and 1 m. 

Figure [5] shows how the small-entropy regions of the histograms H obtained for different B collapse onto each other 
once rescaled by ais{B). As expected, the rescaled H coincide with His in the small |AS'| < (Tjs{B) region, which 
concentrates most of the distribution (Fig. [5]). The universality of the distribution at small AS is not specific to 
the one-dimensional Ising model, but holds, in the thermodynamic limit, for all interacting spin systems when the 
measured connected correlations are corrupted by noise. For a finite system in dimension D with correlation length 
we expect that the small-AS* is universal when N > £^ , where i — ^ log(l/cB). Indeed, the number of large Cij 
coming out of the noisy background is fa N , while, for most of the (^) pairs of spins the connected correlations 
have random values of amplitude cb- 
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FIG. 7: Histograms of AS(ij^k) for the ID-Ising (full distributions, p = .02, C — 1) ^^^id Independent Spin (dotted distribution) 
models, with = 50 spins. Each histogram correspond to one random sample of B configurations. Impulses show the histogram 
for perfect sampling {B = oo), with Dirac peaks located at log jASr] 
the finite value of A'^. 



+ Cst. The cut-off at small entropies comes from 



The full distribution Hjs can be characterized analytically in the iV — >■ oo limit. Details can be found in Appendix 
IeI We find the following scalings, depending on the value of the cluster size, K: 



HisiAS) 



1 



K-2 



for K 



|A^|2/3 

exp(-C2(B) lAS-l 



(-logA^) 

2/(2if-l)(l + o(l)) 



for > 3 (|A5| 
for every K >2 



>0) , 

(large |A^|) 



(51) 



where C2(B) = 2 x S^^-i^/t^^-i' k'-^k^3}/{2K~i} /^2K-1)^ (c^as)"^^^^-^"^^ is proportional to B, see Appendix Hand 
equation (jE2ip . The distribution is therefore characterized by a divergence at the origin, and stretched exponential 
tails. The scalings above were derived with the choice Sq = Smf', in the absence of the reference entropy, the stretched 
exponential has exponent instead of 



2K-1 ■ 



2. Fmite-N effects and lower bound to the threshold O 

The discussion about the localized peaks and the bell-shape distribution in Hjs in the previous Section is an 
oversimplification. In reality, for finite systems, large fluctuations of the sampled correlations take place, and no clear- 
cut boundary exist between cluster-entropies due to the noise and the ones deriving from the interaction network. 
From extreme value theory [SS^ , the largest value of the correlations are of the order of 

cfJAX ^ Ci3%/4loglV. Therefore, 

the largest cluster-entropy is, according to (|C7|) . of the order of 



^grnax _ (Uog N)^^^ -^'> ffj s . (52) 

A more detailed calculation to estimate where this fuzzy boundary between the signal and the noise in the entropy 
distribution takes place is presented below. Let AIk{Q) be the average number of clusters of size K with entropies 
IAS"! > e. According to ([SI]), 

for large Q and N (compared to K). The value of the threshold 9 such that M{Q) — N"", with a < K, is, to the 
leading order in N, 

K — a 



&ia)^[-^-^\ogN] . (54) 
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AS/ajg(B) 

FIG. 8: Same as Fig. [T] after rescaling by the standard deviation (t/s (B) ([50]) of the Independent Spin model. Note the 
linear scale of the x-axis. As a result of the presence of the interaction network, the Ising histograms H are asymmetric in 
AS — >■ —AS for large values of AS, while the IS distributions His are obviously symmetric when averaged over the realizations 
of B configurations (not shown). 




FIG. 9: (a) Number M of clusters {i,j,k) with |AS(ij,fc| > S as a function of the threshold Q, for one realization of _B = 10 
configurations (A'^ = 40 independent spins with p = .0248). The theoretical values for the threshold, 0{2) — 51Q~^'',0(1) = 
1.3 lO"'^^, O{0) = 5.5 10~^^, corresponding to, respectively, M = , N, 1, are shown, (b) Number of clusters as a function of 
their entropy AS. Same data as in (a), on a smaller entropy scale. 



In particular, using the formula above for a = 0, it is likely that no cluster have entropy larger than 6(0), in agreement 
with ((5^ . 

We have tested formula (|54l) through a computation based of a system of iV = 40 Independent Spins, with uniform 
mean p = .0248; these parameters were chosen to mimick real data described in [2]. Figure [9ja) shows the number 
of clusters with entropies larger than Q in absolute value, for a random set oi B = 10^ configurations {K = 3). The 
theoretical predictions based on (j54p are in very good agreement with the simulations. The vast majority of clusters 
have entropies smaller than, say, 8(2). On a smaller entropy scale, the histogram Hjs of the small cluster entropies 
is strongly concentratred around zero as predicted in [51] (Fig. Iljb)). 

As a conclusion, due to the sampling noise, most small cluster-entropies are random quantities, and provide no 
information about the underlying interactions parameters. Imposing a threshold G allows one to remove these artifact 
contributions. A lower bound to the value of is given by (1541) . with, say a = 1 or 2. In practice, we will see that 
higher values of Q may be sufficient for an accurate solution of the inverse Ising problem. 
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C. Properties of the susceptibility matrix and of its inverse 



We now present a theoretical argument suggesting that the truncation scheme we have introduced is robust against 
an increase of the correlation length of the system. More precisely, the maximal size of the clusters to be summed 
up to reach an accurate solution of the inverse problem is not directly related to the correlation length, but rather 
depends on the structure of the interaction graph. 

The susceptibility matrix x characterizes how the observables of the Ising model, such as the averages and 
correlations in p, are modified in response to an infinitesimal change in one or more interaction parameters in J. As 
far as the inverse Ising problem is concerned, it is more natural to ask the following question. Assume the inverse 
problem has been solved for a set of data p and the corresponding interations J have been found. Now imagine that 
the data are slightly changed, p — > p + ^P- How large will be the resulting change S3 in the interactions? The 
response function characterizing the inverse problem, 

is simply the inverse of the susceptibility matrix x- Whether the inverse problem is well-behaved or not will therefore 
depend on the properties of X~^- In particular, it will depend on the largest eigenvalues of x~^ a-nd on the structure 
of the corresponding eigenvectors. 

A quantity which is closely related to ((55|) in liquid theory is the Ornstein-Zernike direct correlation function. The 
direct correlation is widely believed to be short-ranged, as the interaction potenti al J46l| . This property is used in 
closure schemes such as the Percus-Yevick scheme to obtain the equation of state [43] ■ We discuss below in details 
the property of the inverse susceptibility matrix in the case of the spherical model and of the unidimensional Ising 
model. 



1. Case of perfect sampling: properties of x O'^d, x ^ 

Consider first the 0{m) model, where each site i = 1,2,..., TV carries a m-dimensional real- valued spin vector 
(Ti — {al, (t|, . . . , cr™), of norm ^/rrl. As usual, two spins, say, i and j, are coupled through the dot product of their 
spin vectors, —Jij cTi ■ cTi (units of kgT). Hence the interaction J^j couples the same component (a) of the spins in 
the pair The fields hf, with a = 1, 2, . . . , m, are chosen to vanish for simplicity. In the large-m limit the model 
can be exactly solved [40] . The cross-entropy is equal to 

S{p) = - log det p + O ( log m) , (56) 

where p is the N x N matrix with diagonal elements "pa = 1 and off-diagonal elements pij , equal to the average of 
the product of the components a of spins i and j. The elements of the inverse susceptibility matrix are obtained by 
differentiating S{p) twice with respect to p, 

i^~^)kl,k'l' = \ {Jk,k'JlM + JkMJl,k') ■ (57) 

Hence, the inverse susceptibility has the same structure as the interaction graph. In particular, if the coupling matrix 
J is sparse (has many zero elements), so is x~^ ■ On the contrary, the susceptibility matrix x is generally not sparse. 

The observation above is not specific to spherical spins. Consider now the I?-dimensional Ising model with cr^ = 0, 1 
spins on a hypercubic lattice, with nearest neighbour interactions Jij. In the D = 1 case the susceptibility matrix 
(top right corner in matrix (jlOp ) is non zero for all Xi,i c< ', where the proportionality constant does not 
depend on and x = exp(— 1/^). The inverse susceptibility matrix is a tridiagonal matrix [43]: the only non-zero 
elements are 

1 -\- x'^ X 
1 — 1 — a;^ 

As in the spherical model, the structure of the inverse susceptibility matrix is the same as the one of the interaction 
matrix. In dimension D > 2 the inverse susceptibility matrix is not, strictly speaking, sparse. However it exhibits 
a much faster decay with the distance r = \i — i'\ than the susceptibility itself [5g]. At the critical point, the latter 
decays as x(r) ^ 7--(^-2+»?)^ where the critical exponent attached to the decay of the spin-spin correlation, 77, vanishes 
in dimension -D > 4, and is positive and small in dimension D < 3, i.e. V = j for D = 2. The inverse susceptibility is 
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the Laplacian in dimension D > 4, a purely local operator, and decays as X~^('^) ~ r~^^~^'^~^^ for D < 3. While both 
quantities decrease as power laws in r, the inverse susceptibility has a much sharper decay than the susceptibility 
itself. In particular, the integrated contribution to the susceptibility coming from distances larger than R, 



i-n 



(fr x(r) = R 



diverges when i? — )• oo, while the same quantity calculated for the inverse susceptibility. 



1 



(59) 



(60) 



tends to zero as i? — >■ oo. This fact is a good news for the inverse problem. According to ([55]) the error on the field 
hi done when discarding all the spins at distance R > e~i/(3-'/) jg of the order of e only. In this regard, the inverse 
Ising problem remains local even at the critical point. 

While the discussion above is related to the response of a field hi to a change in the average pi' of spin i', the response 
of a coupling Jki following a modification of the 2-point average Pk'u, see ^TO\i . is also of interest. Unfortunately, to 
our best knowledge, this quantity has not been studied in the case of the Ising model so far. As a first step, we 
focus here on the D — 1-Ising model with uniform nearest-neighbour interactions, and in the thermodynamical limit 
{N — > oo). The four-spin connected correlation function is, up to a p-dependent multiplicative constant, equal to 



XijM 



„ii—i3+i2—il 



(61) 



where ii < 12 < *3 < ^4 are the same indices as i,j,k,l but sorted in increasing order, and x = exp(— 1/^) < 1. We 
show in Appendix [G] that the inverse susceptibility matrix is given by 



ij,kl 



= < 



(l-i;2)2 

{i-x'-'y-' 




if 
if 
if 
if 

otherwise 



i — k, j ^ I and j > i + 2 , 
i = k, j = I and j = i + 1 , 
i = ± 1, = Z or i = fc, J = / ± 1 , 
i = /c± l,j = / ± 1 , 



(62) 



Hence, the inverse susceptibility matrix is sparse, with at most 9 non-zero elements per line, while the dimension of 
the matrix is ^N{N — 1) — > 00. In dimension D > 2, we do not expect to be sparse. However we conjecture 
that (x~^),jj ki decays quickly with the minimal distance between the four points i, j, fc, I (each index, e.g. j, is now a 
ZJ-dimensional vector). 



2. Influence of the sampling noise on the norms of x '"^'^ X 



To corroborate this statement we have carried out exact numerical analysis of small bidimensional grids (Section 
IIII C 2p . We show in Fig. [TUT a) the fraction of elements Xij,ki of the susceptibility matrix larger than e = 10"'' in 
absolute value (the largest elements have magnitude ~ 1). This fraction is closed to 1 for all the values J of the 
coupling we have studied. As expected, the inverse susceptibility matrix has many more small elements fFig. llOf b)). 
In addition, the fraction of entries in x~^ smaller than e seem to increase with the size N of the grid. 

In the presence of noise in the sampling process the inverse matrix x^^ loses its quasi-sparse structure. More 
precisely, for the number B of sampled configurations chosen in Fig. fTOFb). all the elements {x~^)ij,ki are larger than 
e in absolute value. Indeed, the quasi-sparsity x~^ in the perfect sampling case refiects the sparse structure of the 
underlying interaction matrix. When data are corrupted by noise, the Ising model (over)fitting the data has no reason 
to be sparse anymore, and neither has the inverse susceptibility. 

The infiuence of the sampling noise on the susceptibility matrix and on its inverse can be measured through the 
largest and smallest eigenvalue of x, denoted by, respectively, X^ax and Xmin- According to Figs. [TlT a.b). we have 
that: 

• Xmax increases with the size of the system (we expect Xmax to diverge at the critical coupling J ~ 1.778 in the 
thermodynamical limit), but is not affected by the sampling noise (the black and red/gray curves associated to 
the same size are nearly indistinguishable in Fig. Illf a)). 

• Xmin is not strongly affected by the system size in the case of perfect sampling. In case of noisy sampling, X„iin 
acquires a smaller value. The effect of the noise increases with the system size (Fig. [TlT b)). 
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FIG. 10: Fraction of elements larger than 10~^ (in absolute value) for the susceptibility x f^nd the inverse susceptibility 
(b) matrices vs. strength J of the nearest-neighbour coupling. The sizes A'^ of the grids are indicated. Data were obtained 
from exact numerations for sizes 3x3, 4x4, 5x5 (perfect sampling, black) and from Monte Carlo simulations for all sizes 
(one realization of B = 4500 configurations, red/gray). Periodic boundary conditions were used. 
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FIG. 11: Largest (a) and smallest (b) eigenvalues of the susceptibility matrix, and norms of x (c) and of its inverse x~^ (d) 
as functions of the coupling strength J for the 3x3 grid. See Fig. [10] for explanations regarding the color code and the line 
styles. 



Those facts are observed from the study of the norms of the two matrices x ^-nd x ^ ■ Here, we define the norm of 
the matrix A through 

||^||=max5]|A,,| . (63) 

j 

Figures [TlT c.d) show that the behaviours of the norms ||x|| and are, from a qualitative point of view, similar to 

the ones of, respectively, Xmax and 1/ Xmin- However the norms are directly related to the magnitudes of the elements 
of the matrices, according to The independence of from the size N, contrary to the strong increase of 

11x11, supports the notion that most elements of x~^ are very small (or even zero) in the case of perfect sampling. 
This property is lost when the sampling is not perfect: the presence of noise in the correlation makes the norm 
increases with A'' (Fig. fTlTd)). 
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FIG. 12: (a) Sum E of the cluster-entropies larger than O (in absolute value) for the nearest neighbour one-dimensional 
Ising model with ^ ~ 8.97 ^ — 4.33, and B = lO'' configurations. The initial increase from zero, taking place at small 
~ A5'(l) ~ 0.41, is not shown, (b) magnification of (a) in the range .029 < G < .033. Within the random sign model, the 
behavior of E within a packet is similar to a Brownian bridge, see main text. 



D. Dependence of the truncated entropy on the threshold 

Hereafter, we study how the error on the entropy S{Q) resulting from the truncation varies with the threshold 
and we discuss the fluctuations of S{Q) — S observed in Fig. El We start by sorting the absolute values of the 
cluster-entropies |A5'r| in decreasing order: 

A5i > AS2 > AS'3 > . . . > ASn > . . . > . (64) 

We call 77„ = ±1 the sign of the cluster-entropy ASr attached (equal in absolute value) to AS'„. Given the threshold 
Q, we define n*(6) as the index of the smallest cluster-entropy larger than 9: ASn-{B) > © > AS'„.(e)+i. The 
truncated entropy p8)) can be rewritten as ^(p, 9) = S(9), where 

n-(e) 



S(9) = V 77„ ASn ■ (65) 



We want to study how S(9) behaves when 9 is made small. In particular, how does the difference es(9) = S](9) — 1](0) 
behave as a function of 9? Is it a smooth function, or does it exhibit large and irregular fluctuations? From a 
mathematical point of view, it is convenient to imagine that N 00. The above question can be formalized as 
whether -i-E(9) converges to some limit value; the normalization factor comes from the fact that we expect the 
cross-entropy to be extensive in the system size N. Depending on the system under consideration, different situations 
can be encountered. 

The most favorable case is when 



^^A5„<oo. (66) 



lim ,^ 

N^oo N 

n>l 

If this condition holds, the difference es(9) can be made arbitrarily small if 9 is small enough. An illustration is 
provided by the one-dimensional Ising model with small correlation length ^ and perfect sampling (B — 00). For this 
model, the sequence of ASn is highly degenerate, and its distinct values are in one-to-one correspondence with the 
integer distances d > 1 between the extremities of the clusters (Fig. [7|). The cluster-entropy AS{d) asymptotically 
decays as exp(— Sd/f), and has multiplity 2*^^^, since each point between the extremities may or may not belong to 
the cluster. We find 

1 ^ A^„ ^ ^ 2''-! exp{-3d/0 , (67) 

n>l d>l 

which converges if f < = Iji^- The calculation above is very similar to the one of Section FlV Al Indeed, when 
the series with general term AS{K) is absolutely convergent, any ordering of the cluster-entropies is possible. In 
particular, one is allowed to sum all the clusters of a given size K as proposed at the beginning of Section HVAl 

What happens when condition (j66p is violated? Again consider the one-dimensional Ising model. For perfect 
sampling, the cancellation property discussed in Section IlII C 21 ensures that -^£(9) has reached its limit AS{d = 1) 



23 



0.1 



n3 

^ 0.01 



0.001 



0.0001 



ID-Ising, N=30 
(1/2)' 



2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 

Block length 1 

FIG. 13: Frequences of the block length £ for the ID-Ising model, with A'^ = 30 spins, ^ — 8.96, and one set of B = 10^ sampled 
configurations. The statistics takes into account the cluster-entropies used to draw Fig. I12f a') only. 



as soon as < AS'(l). In the case of noisy sampling (finite B), the situation is more complex. In the presence of 
noise in the correlations Cki the cluster-entropies with the same distance d between extremities are not degenerate 
any longer. We show in Fig. [T^ a) the value of E as function of Q for a large correlation length ^ compared to ^c, 
and B = 10^ sampled configurations. We observe the appearance of 'packets' of cluster-entropies, located around the 
noiseless values AS{d > 2). The width of a packet depends on the amount of noise due to the sampling, i.e. on the 
number B of sampled configurations. The values of E at the two edges of the packet are very close to one another 
due to the cancellation property. As Q spans the range of cluster-entropies in the packet, E fluctuates. The maximal 
amplitude of the fluctuations seems to weakly increase as we look at packets with smaller and smaller entropies 
(Fig.HHa)). 

We have analyzed the statistics of the signs e„ of the cluters -entropies in (j65|) . Writing the sequence of signs 
V — ivii V^i ■ ■ Oj ^6 consider the blocks j of contiguous and equal signs, and defines their lengths £j. For instance, 
the block lengths corresponding to r/ = (-1-, — , -|-, -|-, -|-, -|-, — , — , — , — , — , +, — , ....) are ii = 2, €2 = 1, ^3 = 4, ^4 = 
5, £5 = 1, .... The histogram of the block-lengths is shown in Fig. [T^ The two main features are: 

• The frequence of £ decreases exponentially when £ <^ N , and is in very good agreement with the exponential 
law (1)^ 

• A large 'structural' block of length £ ^ N \s present. This block corresponds to the N clusters of size K — 2 
(having all sign and the cluster of size K — i with largest entropy (which has the same sign 

We have calculated the correlation between successive block lengths, normalized by the variance of the block length, 



—2 



(68) 

^ - h 

where (•) denotes the average over the blocks j. For the model and the data shown in Figs. [12] and [131 we find 
p ~ .012. Changing the set of sampled configurations does not affect the amplitude of the ratio p, which is always 
found to be about 1%. This ratio coincides with the inverse of the square root of the number of blocks, equal to a few 
thousands. Hence, the analysis is compatible with the absence of any correlation between the lengths of successive 
blocks. The same conclusion is reached with experimental data, e.g. multi-electrode recordings of the activity of a 
neural population 0, [s^ (not shown) . 

The simple statistics sets above suggests the following 'random sign' model, allowing us to deepen our theoretical 
understanding of the behavior of E(0). In the random sign model, the signs rjn are replaced with random variables, 
equal to ± with probabilities ^, and independent from each other. We emphasize that, in E(0) defined in (|55|). the 
signs are deterministic (for given data p). The random sign model is therefore an approximation motivated by the 
statistical analysis above. Assume now that the value chosen for the threshold 8 falls within a packet p including Mp 
clusters. Fluctuations of the order of 

AE - ±Q^/J^p (69) 
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are expected on the entropy. As Q decreases, the size of the packets, Afp, tends to be bigger. Loosely speaking, smaller 
entropies correspond to longer interaction paths, shared by many more clusters. In the case of the one-dimensional 
Ising model, as G decreases, the distance between the extremities of the clusters involved in a packet, d, increases. We 
have J\fp = 2'^-!; hence, AS ^ exp(— 3d/^)'\/2^. We conclude that the error on the entropy tends to zero if ^ < 2^c- 

From the above discussion, it appears that a general, sufficient condition for the amplitude of the fluctuations to 
vanish as —> is 

hm 1 ^(A5„)2 < oo . (70) 

n 

Indeed, if condition ([70| is fulfilled, the sum of the fiuctuations due to all packets corresponding to cluster-entropies 
smaller than 8 is guaranteed to vanish with 8. Hence condition ([70)1 not only ensures that the fiuctuations AS 
attached to the packet 'cut' by 8 vanishes, but also that the error on the entropy, es(8), tends to zero when 8 — 0. 
It is important to realize that the guarantee is of probabilistic nature. Arbitrary large fluctuations are possible (in the 
N oo limit), but are very unlikely. More precisely, within the random sign model, the error is a normal variable, 

e,(8)=AA('o,l J2 (^^")') , (71) 

with a variance vanishing with 8 according to (j70p . The true error is expected to be even smaller than the random 
sign estimate (|71l) . Indeed, packets need not be isolated from each other as in Fig. [TH In the presence of a strong 
sampling noise, or in higher dimension than D = 1, packets will overlap. As a consequence, the number of packets 
'cut' by the threshold and their size will determine the amplitude of AS. Further investigations of those points are 
needed. 

V. ADAPTIVE ALGORITHM FOR THE INVERSE ISING PROBLEM 
A. Procedure to construct and select clusters 

As explained above discarding the cluster-entropies smaller than a threshold 8 is an efficient step against overfitting 
of the sampling noise. In addition, for systems with dilute and strong interactions, we expect that only the clusters 
of neighboring sites on the interaction network will have substantial entropies. These arguments provide a heuristic 
basis for the threshold-based truncation of the expansion ^27\i . 

How can we implement the truncation scheme in practice? The combinatorial explosion of the number of clusters 
of size K among N sites impedes any brute force computation approach, as soon as N is larger than a few tens. Even 
for small-A'' system for which it is feasible, computing ^ 2^ cluster-entropies and, then, discarding most of them does 
not sound like an efficient procedure. 

We propose below an alternative approach, based on a recursive and selective construction of relevant clusters. 
The approach is based on the principle that clusters with large entropies should be compatible with the interaction 
network to be inferred. Suppose that two clusters F and F' have both large entropies, and share most of their spins. 
Then, the union F U F' is a good candidate for a bigger cluster. If the entropy of the union cluster is large, a new 
part of the interaction network will be unveiled. Conversely, if it is small, no new interaction path with respect to 
the one discovered from F and F' separately exists. Hence, combining strongly overlapping clusters should allow us 
to progressively deepen our knowledge of the local structure of the interaction graph. 

The above heuristics is formalized as follows: 

Al. Initial step: build the list of all clusters of size one: Li = {{{) : i = 1,2, .. . ,N}. All the other lists Lk for 
K > 2 are empty. 

A2. Iteration: assume the current size of clusters is -ftr > 1, i.e. Lk is not empty while Lk+i is empty. For every 
pair Fi, F2 in Lk- 

A21. Construction: build F = Fi U F2 

A22. Selection: if F is of size K + 1 and if |AS'r(p)| > 8, then select F and add it to Lk+i. 

A3. Recursion: if at least one cluster has been selected, then add 1 to K, and go to step 2 to pursue the construction 
process. If no cluster has been selected, the construction process is over. 

The first condition in A22 is about the size of F. The union of two clusters of size K has size K + 1 ii and only 
if they have exactly K — I common spins. Fi = (ii, 12 . . . ik-i, x) and F2 = (ii, 12, . . . , ix-i, y) can be merged into 
F = (ii, ^2, • ■ • , iR-i, X, y); the ordering of x, y, and of the ij's is irrelevant here. 
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B. Calculation of A5r(p) 

Step A22 requires the calculation of the cluster-entropy AS'r(p) for each selected cluster F (of size K). In order 
to do so we make use of the formula 

A5r(p) = 5r(p) - (5o)r(p) - ^ A5r'(p) , (72) 

r'cr 
(rvr) 

which can be easily deduced from ((27)) . The procedure is as follows: 

Bl. calculate the subset-entropy '5'r(p) through the minimization of 5/smg(J|p) ([7]) with respect to the fields and 
couplings. The partition function Z[J] is computed as the sum over the 2^ configurations of the spins in F. 

B2. substract the reference entropy (S'o)r(p)- For the mean-field reference entropy, (S'o)r(p) = ^ logdetA/r(p), 
according to formula (PT|) : A/r(p) is the K x K restriction of matrix M(p) to the indices ii, 12, • ■ • , i/f in F. In 
presence of a regularization term equation ([55)) has to be used instead of (1^ to calculate (S'o)r(p)- 

B3. Substract the entropies AS'r'(p) of all the sub-clusters F' of size K' < K, included in F. 

The last step (B3) assumes that the entropies of all the sub-clusters of F are known, i.e. have been computed at 
a previous step in the algorithm. This is true for K' = 2, but not necessarily so for K' > 3. To circumvent this 
difficulty we maintain at all times during the execution of the algorithm the list Laii of all the clusters and of their 
entropies calculated so far; Laii is a larger list than the one of the selected clusters (union of all Lx)- The procedure 
to compute AS'r(p) is then: 

BO. build the list L of all the sub-clusters F' in F not already present in Laii- For each F' G L, starting from the 
smallest sub-cluster and ending up with the largest one, run steps Bl, B2, B3 to obtain AS'r'(p), and add F' 
and its entropy to the list Laii- 

The ordering of L ensures that all the sub-clusters of F' required to calculated its entropy are in Laii when step B3 
is executed. 



C. Calculation of the cross-entropy, couplings and fields 

Once the construction process is finished, the list Lgei = L1UL2UL3U. . - ^Lk^^^ of all selected clusters is available. 
Here, K„iax is the size of the largest cluster selected by the construction procedure. We then 

CI. estimate the cross-entropy through 

5(p) = 5o(p) + ^^r(p) . (73) 

Next we need to estimate the values of the fields and of the couplings, solution to the inverse Ising problem. One 
possibility would be to use recursion relations similar to (j72[) for A/ii^r(p) and AJy^r(p), that is, the contributions to, 
respectively, the field hi and the coupling J.^ coming from the cluster F. Next we could sum up those contributions 
over the clusters included in Lgei- However, to save memory space, it is possible to resort to the following, alternative 
procedure: 

C2. define the 'multiplicities' ray of the subsets F through: 

C21. let Lsub be the list of all clusters in Lsei and of all their subsets. Initialize mr = for every F e Lsub- 

C22. For each F e L^eU and for each F' C F, add (-1)^"-^' (see ([29])) to mr' , where K, K' are the sizes 
of, respectively, F,F'. The sub-clusters F' = F must be taken into account in the addition process. 

C3. estimate the fields and the couplings through 

ht{v) = {hn)i[p) + ^ mr {h^.r{p) ~ (^o)2,r(p)) , 
reL,„6:(i)cr 

Jtjiv) = {Ja)r]{p)+ Y "^r (%r(p) - (Jo)jj,r(p)) • (74) 
reL,„6:(*,i)cr 
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The fields hix and the coupUngs J^-^r in step C3 above are the ones obtained through the minimization of 5'/si„g(J|p) 
over J — {hi j', Jij^r} in step Bl. The fields (/io)i and the couplings (Jo)ij are (minus) the derivatives of the reference 
entropy S'o(p) with respect to pi and pij, see formulas ([22l) . The fields (/io)i,r and the couplings (Jo)ii,r are their 
counterparts for the subset T only, i.e. the derivatives of (S'o)r(p); their expressions are given by (|22p again, upon 
substitution of the N x N matrix M(p) with the K x K matrix Mr(p) restricted to the K elements of F only. 

D. Pseudo-code of the algorithm 

We now give the pseudo-code useful for the implementation of the procedures above. To improve the readability 
the code is broken into several parts. 

We start with Algorithm 1, which computes the cross-entropy and the reference entropy for a subset T. The energy 
function Rising is defined in ([S]). The minimization over J can be done using standard numerical algorithms for 
convex functions. A speed-up is generally obtained when we start with Jm_Fj the value of the interaction parameters 
obtained from the MF approximation ([2^ . as an initial guess for the value of J [45]. In the absence of regularization, 
the parameter 7 is set to 0. It is straightforward to change the pseudo-code to introduce the Li-regularization instead 
of the L2-norm, see formulas (fT4)) . 



Algorithm 1 Computation of entropy 5r(p) — (S'o)r(p) 
Require: F (of size K), data p, regularization parameter 7 

Computation of Sr •' 

Define (S/s,ng)p[J| p] log ^ exp ( - [cr| J]) - ^ /li Pi - ^ J^3 Pij + 'y'^ jfj Pi ~ Pi) Pj ~ Pj) , 

\cre{o,i}^ / ier i<jsr i<j 

where 

J — {hi, Jij} is of dimension ^K{K + 1). 
S'r(p) mjn (S/sing )p [J |p] 

Computation o/(5'o)r-' 

Mr ^ K X K matrix with elements (Mr)i, = p-j "P'^j with i, ? G F 

^/PiC'--Pi)Pj{^-Pj) 

(S'o)r(p) ^ llogdetMr if 7 = 0, or use formula (|36|) if 7 > 0. 
Output: 5'r(p) - (So)r(p) 



Algorithm 2 calculates the entropy AS'r of the cluster F and maintains the list Laii of all cluster-entropies computed 
so far. It calls Algorithm 1 as a subroutine. 



Algorithm 2 Computation of cluster-entropy AS'r(p) 
Require: F (of size K), data p, list Laii — {F', AS'r' (p)} of known cluster-entropies. 

L -f- {F' : F' C F and F' ^ Laii} (ordered in increasing sizes) 
for F' G L do 

AS'r'(p) 5'r(p) — (So)r(p) — ASr"(p) using list Laii of cluster-entropies calculated so far 

r"cr' 
(r'Vr') 

update Laii <- Laii U {F', A5'r'(p)} 
end for 

Output: A5'r(p) and Latt 



We can now give the core part of the procedure, which produces the list of selected clusters: 

Algorithm 4 calculates the estimates for the total cross-entropy, and for the interaction parameters once the list 
of selected clusters Lsei has been obtained. It requires Algorithms 1 and 2; function {Sismg)r and matrix Mr are 
defined in the pseudo-code of Algorithm 1. 
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Algorithm 3 Adaptive cluster algorithm for the inverse Ising problem 
Require: data p, threshold O 

Li^{(i):i = l,2,...,iV} 

Lsel ^ 

K ^ 1 

while Lk is not empty 

Lsel ^ Lsel U Lk 

K ^ K+1 
Lk^D 

for Ti.Tz e Lk-1 do 
r ^ Ti U Ta 

if r is of size K -2 and if lA5'r(p)l < 6, then Lk Lk V T 
end for 
end while 

Output: list Lsel of selected clusters 



Algorithm 4 Estimates for the cross-entropy and for the interaction parameters 
Require: data p, list Lsei of selected clusters 

Computation of cross-entropy S: 

compute S'o(p) using formula H21I) or p3p 

5r(p)^So(p)+ ^^r(p) 

Computation of fields and couplings J = {hi, Jij}: 
compute {(/io)i(p)i (>^o)ij(p)} using formula 

Lsub ^ {r' c r : r G Lsei} 

for r' G Lsub do 

mr' ~ (—1)''"' ', where |r|,|r'| are the sizes of r,r'. 

{/ii(p), Jjj(p)} ^ argniin [Sist^g] j.[J\p] 

compute {{ho)i,r{p), iJo)ij,r{p)} using formula (p2)) . with Mr replacing M. 
end for 

compute {fti(p), >/ij(p)} using formula (|74p 
Output: S'(P),{/1,(P),J'«.(P)} 



VI. APPLICATIONS 

In this Section we report the results of our algorithm when applied to data generated from Ising models with diverse 
interaction structures and various numbers B of sampled configurations. We define: 

• the number N^iu of clusters generated by the algorithm and the size K„^ax of the largest clusters. 

• the average error on the inferred couplings and fields: 

Here, J^^^ and /i*"-' denote the values of, respectively, the inferred couplings and fields, while and hi are the 
values of the couplings and fields in the model used to generate the data. 

• The error bars Shi and 6Jki on the inferred couplings and fields, resulting from the finite sampling. Those 
statistical fluctuations are asymptotically given by the inverse of the susceptibility matrix of the cross-entropy 
Si sing, see equation (fT7|) . The entries of x can be calculated from a Monte Carlo simulation, to estimate the 
multi-spins correlations. In practice, a good approximation of x can already be obtained from the empirical 
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average over the B configurations in the sampling set. This procedure avoids the use of the Monte Carlo. In 
the presence of a L2-regularization ([T^ . 7Pfc(l — Pk)pii^ — Pi) is added to the diagonal element Xki,ki of the 
susceptibility matrix, before the inversion is performed. Hence, all the eigenvalues are strictly positive and the 
inverse is well defined. The inversion of x can be done with standard linear algebra routines. 

Inferred couplings are called 'reliable' when their absolute value is larger than three times their statistical 
error-bar: \Jki \ > 3 SJki- 

• The reconstructed observables, p^'^'^ and c^J"^, which we compare to the data, pi and Cij. Those reconstructed 
averages are obtained using Monte Carlo simulations of the Ising model with the inferred fields, h"'''^ , and 
couplings, Jij''^ ■ For those simulations the number of sampled configurations is chosen to be much larger than 
B, e.g. 100 -B, to minimize the uncertainty on the reconstructed averages. 

• the relative errors on the reconstructed averages pi and connected correlations Cij , with respect to their statistical 
fiuctuations due to finite sampling: 

where the denominators in (|76p measure the typical fiuctuations of the data expected at thermal equilibrium, 
see ([T5)) . and 

5cki = Spki + PkSpi + piSpk . (77) 

If not explicitly stated otherwise we start from the value = 1 for the threshold and run the algorithm several times, 
dividing the threshold by 1.01 after each execution. The algorithm is stopped when both errors ep and ec are close 
to 1. We call 9* the final value of the threshold corresponding to this criterion. Unless explicitly stated otherwise a 
L2-regularization term (|13p is present, with 7 — 1/(10 Bp^ (1— p)^), where p is the average value of the pi's (Appendix 
E]) . As explained in Section III Bl the regularization term is important in case of undersampling and guarantees the 
convergence of the numerical minimization of Sismg- 

A. Independent Spin Model 

It is instructive to run first the algorithm on the Independent Spins model, where each spin has a probability pi 
to be 1, 1 — Pi to be 0, independently of the other variables. Due to the noise in the sampling (finite value of B), 
the connected correlations Cij are not equal to zero. Figure [Ml shows the outcome for a system of size N = 40, as 
a function of the threshold Q. The errors of reconstruction, Cp and ec, are already smaller than one for the initial 
threshold value 9* = 1. For this value of the threshold, cluster of size one only are selected. In other words, the 
interaction network Jq, calculated from the reference entropy So = Smf alone, is already overfitting the data as it 
attempts to reproduce the correlations due to statistical fiuctuations. For smaller thresholds 9 contributions from 
clusters of size K > 2 allow for an even more precise reproduction of the data . 

The histogram of the inferred couplings, JjJ'^ , is shown in Fig. 1151 It is centered around zero, and is approximately 
Gaussian. The standard deviation of the distribution is compatible with the statistical error bar on couplings SJij 
([TTI) averaged on all the couplings. For the particular case of Fig. [HI 815 of the 820 inferred couplings are away from 
zero by less than three error bars, and are, therefore, classified as unreliable. This result is compatible with the fact 
that the non-zero couplings are the consequence of overfitting and do not reflect any real interactions. 

Another possibility to avoid overfitting in this case is to apply the cluster expansion to the entropy S in the absence 
of reference entropy {Sq = 0). We find that, for 9* = 1, the reconstruction errors are Cp = 0.07 and Ec = 0.8. 
Therefore only one-spin clusters are taken into account, and all couplings are equal to zero exactly (since Jq — 0). 

B. Unidimensional Ising model 

We now test the algorithm on the unidimensional Ising model with first-neighbor interactions, Ji,i+i = J, and 
uniform fields, = h. The model is placed on a ring with N sites (periodic boundary conditions). Data p can be 
computed exactly (Appendix |F| or through an average over B configurations, sampled by Monte Carlo simulations. 
We compare the performance of the inference procedure for various values of B and two values of J, h, corresponding 
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FIG. 14: Performance of the algorithm as a function of Q for the Independent Spin model: (a) errors ep and ec; (b) cross- 
entropy S The entropy of the model in absence of sampHng noise is ~ 4.182071; (c) size Kmax of largest clusters; (d) number 
Nciu of clusters. Panels (e) and (f ) show the reconstructed pi and Cij vs. their values in the data. Error bars on pi and Cij are 
calculated through (|77[). Data were obtained by sampling B = 10^ configurations of the N = 40 spins, with spin dependent 
means Pi equal to the average activity of the neurons in j^]. The optimal threshold Q* is already obtained with clusters of size 
K =1. 




Couplings J 

FIG. 15: Histogram of the inferred couplings for the Independent Spin model of Fig. 1141 and O* = 1.815 of the 820 inferred 
couplings unreliable (in gray), because compatible with zero within three standard deviations 



to the correlation lengths ^ ~ 1 {h = —5, J = 4) and ~ 9 {h — —5.95, J = 6). These values are, respectively, 
smaller than — 4.3, the correlation length below which the cross-entropy expansion (I65p is absolutely convergent, 
and larger than 2^c — 8.6, above which condition (|7n|) is violated, see Section HVD I 



1. Accuracy of the cluster expansion as a function of the threshold: errors on the entropy, couplings and fields 

We start with the small-^ case. Figure fTBT top) shows eg, the absolute value of the difference between the cross- 
entropy ^(p, 6) (pSj) and the entropy of the model for perfect sampling, and the errors ej and eh, for various values of 
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FIG. 16: Errors on the entropy (es), the couphngs (ej), and the fields (e/j) vs. threshold 6 for the unidimensional Ising model 
with ^ = 1 (top) and ^ = 9 (bottom). Full fines correspond to large sampfing noise {B = 10^ for C = 1, B = 10^ for C = 9), 
dashed line to weak sampling noise {B = 10'^ for ^ = 1, B = 10^ for ^ = 9). The size is A'' = 30. 



B. We observe that eg sharply decreases around 6i = 0.05, that is, the entropy of nearest-neighbor clusters AS'(j j+ij; 
discarding all entropies smaller than this value would be exact in the perfect sampling case (Section IIII C 2|) . As Q 
is decreased, es exhibits fluctuations centered around a discrete sequence of threshold values, 0d, with d > 2. As 
explained in Section IIII C 21 these values correspond to the cluster-entropies A5'(i i^^^) in the absence of noise, see 
identity (j45p . Fluctuations spread over a small window around Qd- They correspond to imperfect cancellations 
of 'packets' of entropies whose associated clusters share the same interaction path with length L = 2d (Section 
lIVPp . Since the correlation length ^ is small, and, therefore, the cross entropy expansion is absolutely convergent, 
the magnitude of the fluctuations quickly decreases with d. For B = 10^ two bursts of fluctuations are visible 
(corresponding to d = 2, 3). For B — 10^, fluctuations are smaller and spread over narrower windows; only the d = 2 
burst can be observed. In between two bursts of fluctuations, £5 reaches a plateau. Note that the value of es on the 
plateau is not zero due to the sampling noise (finite B). 

The errors on the inferred couplings and fields have the same behaviour as es (Fig. ll6f top)). Their magnitude are 
comparable to the expected statistical fluctuations calculated from the 4-spin correlations (inverse susceptibility in 
(fTTl) '). which decrease as B~^/^. 

We now test our algorithm on the unidimensional Ising model with a large correlation length, ^ = 9. The errors 
^s,^Ji^h are shown in Fig. [rGT bottom) as functions of Q. The correlation length of the model is larger than 2^c, 
at which the series of the squared cluster-entropies is divergent. Therefore, we expect large fluctuations oi es, 
corresponding to packets of clusters with the same interaction path (Section IIVD[) . Figure [T51 shows, indeed, that 
fluctuations are much larger for ^ = 9 (bottom) than for ^ = 1 (top). Furthermore, fluctuations do not decrease much 
in amplitude as Q is decreased. In the case of severe undersampling {B — 1000) the notion of bursts of fluctuations 
separated by plateaus is blurred out. The distributions of cluster-entropies in a packet is so wide that it overlaps with 
the distributions of entropies associated to the neighbouring packets. As for the previous case the magnitude of ej, eh 
are comparable to the expected statistical fluctuations calculated from ([T7|. 
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FIG. 17: Performance of the algorithm on the unidimensional Ising model with ^ = 1, B = 10^ (full-bold line) B — 10^ (dashed 
line). For both values of B the optimal threshold 0.0007 < O* < 0.003 is reached with clusters of size K — 3 and length L = 6. 
The reconstruction of data pi and Cij is shown only for the B = 10^ case (large sampling noise). Error bars on pi and Cij are 
computed from (|77p . 




FIG. 18: Performance of the algorithm on the unidimensional Ising model with ^ = 9, B — 10^ (full-bold line) and B — 10* 
(dashed line). For the largest sampling noise case {B = 10^) at the threshold 9* ~ .104, 89 clusters of size K = 2, 92 clusters 
of size = 3, 35 clusters of size = 4, and 2 clusters of size K = 5 are selected. The reconstruction of data Pi and dj is 
shown for the B = 10"^ case. Error bars on pi and dj are computed from (I77p . 



2. Quality of reconstruction and choice of the threshold O* 

We now study the reconstruction errors ec and tp as functions of 0, for the small and large correlation lengths and 
for the weak and strong sampling noises. We show in Fig. [T7]the errors ec and tp, as well as the maximal size, K^axi 
and the number, Ndm of clusters vs. the threshold 9 in the case of a small correlation length, ^ = 1 . The threshold 
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at which both Cc and Cp are close to 1 can be chosen in the range 0.0007 < 9* < 0.003, for which all the clusters 
of lengths L < 4 are processed. It is possible to check in Fig. [TST top) that this threshold value gives the minimum 
values of eg, ej, th ■ Contrary to the case of perfect sampling, it is not sufficient to take into account clusters with 
contour length L = 2 only. The selected clusters correspond to three groups of = 30 clusters each; the first group 
gathers the clusters + 1) {L = 2), the second one, the clusters + 2) {L = 4) and the third one, the clusters 
+ l,i + 2) {L ~ 4). In Fig. [IT] we show the reconstructed averages pi and correlations Cy, at O* and for the 
largest sampling noise case B = 10^, vs. their values in the data . The agreement is very good, and falls within the 
statistical fiuctuations expected at equilibrium for the Ising model, given in (|77p . We stress that the optimal value of 
the threshold and the maximal size of selected clusters depend on the particular realization of the data p, and can 
vary from sample to sample. By further decreasing the threshold O below Q* , the reconstruction errors and Cc 
decrease to values smaller than one (Fig. [T71) . This regime corresponds to an overffiting of the data, as the errors on 
the couplings, ej, and on the fields, eh, cease to decrease (Fig. [TU^top)). 

Results for the case of a larger correlation length (^ = 9) with B — 10^ and B — 10'^ sampling configurations are 
reported in Fig. [181 For the poor sampling case {B — 10^) plateaus are not present any longer, but a good inference 
is still obtained for a value 8* of the threshold, which, as in the ^ = 1 case, corresponds to the summation of all the 
clusters with contour length L = 4. This finding supports the discussion of Section IIV CI the contour length required 
for a good inference is largely independent of the correlation length. However, in the poor sampling case, finding the 
right value for Q* is harder for larger ^ due to the mixing of packets. 

3. Quality of the inference: histograms of couplings 

To better understand the quality of the inference we plot in Fig. [12] (up and middle panels) the histogram of the 
inferred couplings Jj^-^ at the threshold 8*. The distribution is bimodal: a Gaussian-like peak centered in J™f = 
and a smaller distribution around J"'-' = 4. The two sub-distributions are separated by a wide gap. The inference 
algorithm makes no classification error: the sub-distribution centered around J™^ = contains all the pairs 
such that Jij — 0, and the one around J™-'' — 4 includes all the pairs of nearest neighbours {i, i + All the couplings 
centered around zero are unreliable. Moreover the standard deviation of the distribution of the couplings around the 
zero value (equal to the minimal value of ej reached on the plateau in Fig.[T6|) agrees with the statistical fluctuations 
([T7]); all the couplings around zero are therefore unreliable. The structure of the interaction network is perfectly 
recovered. 

We show the histogram of couplings for 8 > 8*, i.e. > 1, in Fig. [19] (bottom); the structure of the interaction 
network is still perfectly recovered but the values of the positive inferred couplings is less accurate. The histogram 
of inferred couplings for the large correlation length , = 9, is shown in Fig. [50] Even when the sampling noise is 
large, a good separation of the two sub-distributions corresponding to interacting and non-interacting pairs of spins is 
achieved for large values of the threshold 8*, and the couplings are correctly inferred (up to the expected statistical 
fluctuations). 

C. Regular bidimensional grid 

We now analyze the performance of the algorithm on bidimensional grids of different sizes, N — N' x N' . Nearest 
neighbors on the grid interact through the coupling J. The value of J ~ 1.778 is chosen to make the grid critical in 
the thermodynamical limit, iV' — >■ oo [13, H^. Hence, the system is at the paramagnetic/ferromagnetic critical point, 
and the correlation length ^' diverges with N' . 

We have described in Section IIV Al and Fig. [5] the partial cancellation of the cluster-entropies for a small bidimen- 
sional grid (N' = 3), and no sampling noise. Due to this cancellation property, taking into account clusters of contour 
length L < 4 was sufficient to obtain a very accurate approximation to the cross-entropy. Hereafter, we show that 
this result is not affected by the presence of sampling noise. Furthermore, we will see that the size of the clusters 
necessary for a good inference of the interactions remains rather constant when the grid size is increased from N' = 7 
to N' — 20, and is thus, as in Section fVl Bl largely independent of the correlation length ^. 

1. The small 3x3 grid revisited: influence of the sampling noise 

We start with the 3x3 grid of Section IIV Al for which the summations of all clusters up to size K — 9 gives 
the exact solution of the inverse problem, and all 1- and 2-point averages can be calculated exactly in the perfect 
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FIG. 19: Histograms of the inferred couplings JlJ^ for the unidimensional Ising model with ^ = 1. Couplings equal to J = 4 
and J = in the model are shown in, respectively, black and gray. Gray couplings are unreliable, as they differ from zero by 
less than three standard deviations. Values of B and O are indicated on the figure. The bin width is AJ = .08. 
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FIG. 20: Same as Fig. 1191 but with ^ = 9 instead of ^ = 1. Couplings equal to J = 6 and J = in the model are shown in, 
respectively, black and gray. Gray couplings are unreliable, as they differ from zero by less than three standard deviations. 



sampling case. The reader is kindly referred to Fig. |6] and to the related discussion. Figure [2T] shows the errors 
on the entropy, the couplings, the fields, the reconstructed 2- and 1-point averages, and the size and the number of 
clusters as functions of Q. For a given amount of sampling noise (set by the value of B), the errors es,£j,€h follow 
their perfect-sampling counterparts, until a threshold value Qsat, and they saturate for Q < Qsat- The saturations 
are interrupted by fluctuations due to the imperfect cancellations of clusters within a packet (Section IIVD|) . The 
saturation values of ej and eh decrease with increasing B, and are compatible with the expected statistical fluctuations 
SJij given by (|17[) . The value of Qsat approximately coincides with the threshold Q*, at which both ec and ep are 
~ 1 (Fig. [21]) . For B = 4500, only clusters of size K — 2 are taken into account at O*. For B — 10^, clusters made 
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FIG. 21: Errors e^, eh, tc, tp, size Kmax, and number Kdu of clusters vs. Q for a 3 x 3 grid with J = 1.778. The dashed line 
corresponds to B = 4500 sampled configurations, the full line to B = lO'^; the perfect sampling curves are shown with dotted 
lines in the top three panels (same data as Fig. |6]). The values of Q* are shown with vertical lines: 0* — .0017 (full, B — 10^), 
and .007 (dashed, B = 4500). At 9* 21 clusters {B = 4500) and 49 clusters {B = 10^) are selected 
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FIG. 22: Error ej for a bidimensional grid A''' x A'^', with A'^' = 3, 7, 12, 20, and for B = 4500 sampled configurations. 
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FIG. 23: Reconstruction errors tc. (squares) and tp (diamonds) for grids of sizes 7 x 7, 9 x 9,12 x 12 and 20 x 20. Optimal 
thresholds O* are located by arrows. Other possible choices for Q* , for instance O* = .024 for the 7x7 grid are equally 
possible. 



of spins on the elementary squares of Fig. [5] and of size up to A' = 4 are selected, e.g. 1, 5 or 1, 2, 4, 5 with L = 4 in 
Fig. [SJ while clusters such as (1, 2, 3) or (1, 3), which have the same contour length but are not on elementary squares, 
are discarded. At 8* the histogram of the inferred couplings is made of two far apart sub-distributions (not shown): 
the first one corresponds to the 12 non zero couplings and is centered around J™-^ ~ 1.8, and the second one, peaked 
around J*"-^ = 0, contains the remaining 24 pairs of sites. Hence, the structure of the interaction graph is correctly 
found back. 



2. Study of larger critical grids 

We now turn to larger grids N' x A'"', where A'^' ranges between 7 and 20. Data are calculated from B = 4500 
configurations sampled through a Monte Carlo simulation. The error ej on the couplings is shown in Fig. 1221 As Q 
decreases, ej saturates to a value close to the average of the expected statistical error, (5Jy, which lies between .1 
and .2. Saturation begins at large values of the threshold, even when the linear size N' of the grid is increased. The 
asymptotic values depends strongly on N due to the non periodic boundary conditions (STj . 

The reconstructions errors ep and Ec are shown in Fig. 1231 The maximal size of the clusters at the optimal threshold 
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FIG. 24: Histograms of inferred couplings for bidimensional grids of different sizes. Couplings equal to J = 1.778 and in 
the model are shown in, respectively, black and gray; the integral of each sub-distribution is normalized to one by hand. The 
values of Q* for each size are shown in Fig. 1231 



Q* is bounded [Kmax < 4), as the correlation length ^ diverges with N' . As a consequence, the running time of 
the algorithm increases linearly with N . For threshold values smaller than O*, and decrease. The 7x7 grid 
in Fig. [231 provides a clear illustration of data overfitting. Keeping B fixed while N' and A'' increase make the data 
effectively more and more noisy. This is the reason why the value of 9* slightly increases with N . 

The histograms of the inferred couplings at the threshold Q* are shown in Fig. [24l The structure of the grid is 
perfectly reconstructed for all sizes N' . We find that all the inferred couplings in the sub-distribution centered around 
zero are smaller (in absolute value) than three times their standard deviation (corresponding to the asymptotic value 
of Fig. [22t and are, therefore, unreliable. 



D. Randomly diluted bidimensional grid 



We now remove a fraction 1 — p of the couplings on the grid, independently and at random. Our goal is to study how 
the algorithm performs on such disordered systems, at the phase transition and in the low temperature phase. We 
will compare the performance with another low complexity algorithm, the regularized logistic regression algorithm, 
guaranteed to perform well at high temperature and to fail at low temperature [l5j . To compare with the numerical 
experiments of [l6| . we have generated 7x7 bidimensional grids, and keep each bond with probability p — .7. The 
remaining bonds are all equal to J [ssl ]. We generate, for each value of J ranging from 0.4 to 4.4, eight randomly 
diluted grids. For each grid we calculate the data p by sampling over B — 4500 configurations generated by a Monte 
Carlo dynamics. 



1. Inference of the network structure from the mean field entropy Smf 

Our first task consists, as in pli|, in reconstructing the structure of the interaction graph only. We do not want 
to accurately determine the value of the coupling constants Jij , but only if it is positive or null. This task is easier 
than the precise inference of the couplings, and we will first handle by approximating the cross-entropy S with the 
reference entropy 5*0 — Smf only. Equivalently, we choose O* to be large enough that no cluster is selected by our 
algorithm. We compute the mean-field couplings, {Jo)ij, and, for each pair (i, j), and decide that a bond is present if 
{Jo)ij > absent otherwise. The performances of this simple, Mean-Field algorithm are shown in Fig. [25l We say 
that the neighborhood of a vertex i is reconstructed if the sets of its neighbors j { Jij ^ 0) and of its non-neighbors 
(Jij = 0) are correctly inferred. In Fig. [55] (top panel) we report the fraction Qsucc of the neighborhoods which are 
reconstructed (straight line), as a function of the coupling strength J. We compare the performance of the mean-field 
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FIG. 25: Probabilities that a neighborhoods is reconstructed (top) and that a bond is inferred (bottom) as functions of 
the coupling J for a bidimensional random 7x7 grid density p = .7 and for various values of the numbers B of sampled 
configurations. Error bars are calculated from the standard deviations over eight random grids. Probabilities were obtained 
from the simple Mean-Field algorithm (So = Smf, = 1). The value of Qsucc (top) is compared to the performances of the 
pseudo-likelihood algorithm (Rlr) of [Tsl . [igI ] in the top panel. 



algorithm with the pseudo- likelihood algorithm of [3l in Fig. ISSl right). Contrary to the pseudo- likelihood algorithm 
case, the neighborhoods are perfectly reconstructed at the phase transition. Furthermore, Qsucc remains large in the 
ferromagnetic phase: for instance, for J = 3.2, more than 80% of neighborhoods are perfectly inferred. For very large 
J (low temperatures) the average pi are too close to in the down state and to 1 in the up state. Most of the sampled 
configurations coincide with one of the two ground states, and the inference is difhcult. Fig. [SSltop panel) shows the 
increase of Qsucc resulting from a ten-fold increase of B. 

Another measure of the performances is shown in Fig. [25l bottom panel). We plot Psucc, the fraction of bonds in 
the grid correctly predicted to exist, averaged over the data realizations, as a function of the coupling strength J. For 
J = 3.2 more than 99.78% (respectively 99.96% ) of the bonds are correctly predicted with B = 4500 (respectively 
B — 45000) configurations. For even larger values of the coupling constant, J = 4, more than 95 % (respectively 
99%) of the bonds are correctly predicted with B — 4500 (respectively B = 45000) with the Mean-Field algorithm. 

In Section IVI D 31 we show that the probability of success increases in the ferromagnetic phase, when using our 
algorithm with a well-chosen threshold Q rather than the simple Mean-Field procedure (9 = 1), e.g. all neighborhoods 
are perfectly reconstructed for J = 3.2 {Qsucc = !)• 



2. Is thermalization relevant to the inference at low temperatures? 

The diluted bidimensional grid, with a fraction p — .7 of non-zero bonds, undergoes a transition from a paramagnetic 
to a ferromagnetic phase at the value Jcrit{p = .7) ~ 2.8 (vertical line in Fig. [25|) in the infinite size limit [Uli^l. In 
the ferromagnetic phase, J > Jcrit{p — 0.7), and for small bidimensional grids, two competing 'states' coexist: the 
'down' state, where most spins are 0, and the up state, where most spins are equal to 1. The system 'jumps' from 
one state to the other, as shown by the time-dependence of the average activity, 

1 ^ 

/^w = ]^E^»W' (78) 

1=1 

where t is the Monte Carlo time. Figure [26l a) shows that the two states are equally sampled on a 9 x 9 grid, with 
Na = 10,000 single spin-flip attempts with the Metropolis rule in between two sampled configurations (the results of 
Fig. [25] on the 7x7 grid were obtained with the same value of Na). To investigate the performance of the algorithm 
when the two states are not well sampled we have studied a 9 x 9 grid, with Na = 100 and Na = 1, 000. For Na = 100, 
fig. I^STb) shows that few transitions occur, and that the two states will likely not be weighted equally. On larger 
12 X 12 grids no jump occurs, even with A''^ = 1, 000 spin-flip attempts (Fig. I^c)). The values of the spin averages 
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FIG. 26: (a,b,c). Average activity /i (|78p of the B = 4500 sampled configurations for three grid sizes and numbers of spin-fiip 
attempts in between two samplings, (d) Probability of success for J > 2.8, for the same data as in (a,b,c) and for the 7x7 
grid of Fig. ^ 
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FIG. 27: Performances of the inference algorithm for the 7x7 randomly diluted grid, with J = 3.2, B = 4500, Na ~ 10"". 
Left: Relative errors ep, ec as functions of the threshold (top) and comparison of reconstructed and data correlations at Q* 
(bottom). Middle and Right: comparison of the inferred and true fields (top) and histograms of inferred couplings (bottom) for 
our cluster algorithm and the mean field procedure. Color code for the histograms: brown/gray: unreliable couplings (which 
also correspond to zero couplings in the true network), black : reliable couplings. The two sub-distributions are normalized 
separately. 



Pi will strongly vary between the three cases above: pi ~ .5 in the mixed case (a), pi ~ .7 in the particular realization 
(b) of the partially mixed case, and pi close to zero in case (c). Remarkably the probability of success of the algorithm 
is not sensitive to the nature of the mixing. Fig. I26f d) shows, indeed, that the reconstruction performances do not 
significantly decrease even in the partially and non mixed cases compared to the fully thermalized case. 



3. Inference of the couplings and reconstruction of 1- and 2-point averages with the cluster expansion 



It is harder to determine the values of the fields and of the couplings and to reconstruct the frequencies than to infer 
the structure of the interaction graph alone. To this aim the minimization of the MF entropy 5^^^ is generally not 
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FIG. 28: Performance of the inference algorithm in the case of poor mixing, for a 7 x 7 randomly diluted grid, with J = 3.2, 
B = 4500, Na = 10^ Top & left: relative errors ep, ec as functions of the threshold O. Bottom & left: reconstructed pi and 
Cij versus their data values. Right: Inferred fields ft™-' vs. their true values for O* = 0.05 (top) and histogram of the inferred 
couplings J'"^ (bottom). As usual, unreliable couplings, which correspond to zero couplings in the true network, are depicted 
in gray, and reliable couplings in black. 



sufficient, and the cluster expansion of 5 — 5**^^ has to be carried out. In Fig. [27] we show the relative errors Cp and 
Cc in the reconstruction of one- and two-site frequencies as a function of the threshold 9 for the same 7x7 randomly 
diluted grid as in Section lVID 11 We also compare the fields and couplings inferred with our cluster algorithm (middle 
panels in Fig. [27l 8* — .05) to the ones found with the simple MF procedure (right panels in Fig. [27l large value 
of Q). Note that the small error done in the graph learning for J = 3.2 {Psucc = 0.997) can be avoided when the 
threshold value is optimized for each data realization. 

We show in Fig.[28]the performances of the algorithm in the case of poor mixing, when the two states are not equally 
sampled. For the particular realization corresponding to Fig. 1281 the frequencies are pi ~ 0.3 instead of pi ~ 0.5. In 
spite of the poor mixing the inference of the fields and couplings is as accurate as in the case of well-mixed sampling. 
The difference between the fields corresponding to the apparent frequencies pi ~ 0.3 and the true one {pi — .5) are, 
indeed, smaller than the statistical uncertainty on the fields due to the limited sampling (finite value of B). The 
reason is that, near a critical point, a small variation in the field is sufficient to produce a large change in the average 
values of the spins. 

E. Erdos-Renyi random graphs 

In this Section we report the results of our inference algorithm when applied to disordered Ising models on random 
graphs. The random networks are generated from the Erdos-Renyi ensemble, where M = ^ N edges are drawn, 
uniformly and at random, between N points. Parameter d is the average degree of a vertex on the network. 

Figure UHl shows the outcome of the algorithm when data are generated from an Erdos-Renyi model of connectivity 
d = 10. On the selected bonds (i,j) the couplings Jij were chosen uniformly at random in [—3; 3]. All other 
couplings Jij were set to zero, and the fields were hi = —1. Values of the parameters are such that the system 
is in the paramagnetic phase (in the thermodynamic limit). Panel A shows the inference with good sampling (the 
data are obtained by averaging over B = 10^ Monte Carlo configurations), while Panel B shows the inference 
with poor sampling {B — 10'^). At G* the data are reconstructed within the expected statistical fluctuations and, 
correspondingly, couplings are found back within the statistical error bars SJij. In the case of a large sampling noise 
case B = 10^, the statistical fluctuations SJij are so large that most of the inferred couplings are unreliable. The 
inference of the complete network is thus not possible. The maximal size of clusters at 0* increases with the average 
degree (Section rVIF[) : we find Kmax = 9 and K^ax = 7 for, respectively, B = 10^ and B = 10^. 

Fig. 1301 shows the outcome of the algorithm on an Erdos-Renyi random graph with a smaller connectivity, d — 5, 
and for values of the couplings Jij chosen uniformly at random in [—4; 4]. The fields are set to hi ^ ^j{^i) Jij^ 
in such a way that the corresponding fields in spin variable ±1 vanish. This is an example of a smaller connectivity 
system in the spin-glass phase. We have studied the performance of the algorithm as a function of the threshold 0, 
by varying the system size N from 50 to 200 and the number of sampled configurations from B = 1000 to 10000. The 
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FIG. 29: Outcome of the inference algorithm for an Erdos-Renyi random graph with A'' = 50 spins, connectivity d = 10, and 
with B = 10^ (a) and B = 10^ (b) sampled configurations. For each value of B we show the errors €p, €c vs. 6, the inferred vs. 
data values of the correlations Cij, and of the couplings Jij. A few large error bars over Jij (calculated from x~^) shown. 
Values of Jij were chosen uniformly at random in [—3; 3], and fields were set to hi = —1. 



algorithm is able to reach Ec = 1 at large thresholds, with a small number of selected clusters, e.g. Ndu < 1000 and 
Kmax = 7 for TV = 100. The threshold Q* for which ec — 1 corresponds to the beginning of the plateau for ej. For 
smaller thresholds Cc decreases and data are overfitted. The height of the plateau for ej coincides with the calculated 
statistical error SJ; it scales as 1/VB and does not strongly depend on N. 

F. Computational Time 

For a given value Q of the threshold the computational time can be estimated through 

time- NciuiK) 2^ , (79) 

K=l 

where Nciu{K) is the number of selected clusters of size K , and 2^ is the number of operations necessary to calculate 
exactly the partition function of a sub-system of size K . As the number of selected clusters depends on the interaction 
graph, the computational time is sensitive to the structure of the interaction graph, while it does not depend too 
much on the correlation length of the system. For instance, the number of processed clusters and the computational 
time for Erdos-Renyi graphs is larger for the connectivity d — \Q than for d = 5. 

Moreover, as the sampling noise increases (the value of B is made smaller), so does the threshold value Q* . As less 
precision is needed in the reconstructed frequencies and correlations, the size of the selected clusters is reduced. As a 
consequence, the computational time is reduced. Figure [3T] illustrates this statement for Erdos-Renyi random graphs: 
the running time increases with the quality of the sampling, i.e. increases with the value of B. In some very noisy 
cases, however, large size clusters which are due only to the noise and do not reflect the interaction network can be 
processed. As an example, the number of clusters for the unidimensional Ising model with ^ = 9, _B = 1000 is larger 
than the one for B — 4500. Another illustration is given by the diluted 7x7 grid with J = 3.2, which requires the 
processing of many clusters of large size {Kmax ~ 8). 

VII. CONCLUSION AND PERSPECTIVES 

In this paper, we have presented an adaptive cluster expansion to infer the interactions between a set of Ising 
variables from the measure of their equilibrium correlations. We have discussed the statistical mechanics of this 
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FIG. 30: Outcome of the inference algorithm for an Erdos-Renyi random graph with A'^ spins, and connectivity d = 5. Values 
of Jij were chosen uniformly at random in [—4; 4], and fields were set to hi = Jij- Left: error ej on the inferred 

couplings as a function of 6 for Af = 50 and B = 1000, 4500, 10000 configurations. Middle: error ej vs. 9 for iV = 50, 100, 200 
and for B — 4500. Right: (top) and K^u (middle) as functions of B; the inferred couplings J'"' are compared to their true 
values in the bottom panel for A'' = 100 and threshold O*. 



expansion, and shown applications of the algorithm to artificial data generated using Ising models on unidimensional 
and bidimensional lattices, as well as on Erdos-Renyi random graphs. 

We have in particular underlined the important conditions on the inverse problem that should be fulfilled for our 
algorithm to be efficient. The essential condition is that the inverse susceptibility, which determines the change of a 
coupling (or a field) resulting from a change in the data (1- or 2-spin frequencies) should be well-conditioned. We 
stress that this property is not incompatible with the presence of a long-range susceptibility. Hence, the inverse 
problem can be easy to handle even in the presence of long-range correlations. As far as our algorithm is concerned, 
this condition entails that the maximal size K^iu of the clusters which need to be taken into account remains small 
even if the correlation length of the system is large. 

The origin of this condition is that our algorithm builds up, by definition, an interaction network defining a well- 
conditioned Ising model. Indeed, in the absence of reference entropy (^o — 0), the cross-entropy >5'(p) is approximated 
through a sum of a restricted number of cluster-entropies, see (pO|) . For sufficiently large thresholds 9, most quadru- 
plets of variables, say, i,j,k,l, do not appear in any selected cluster (of size K > 4); hence, most of the entries 
ix~^)ij,ki of the inverse susceptibility matrix entries vanish according to (j55p . In the presence of the reference entropy 
So = Smf, is not guaranteed to be sparse any more due to the contribution Xq^ = — ^^q^ ■ However, when a 
regularization is introduced, e.g. based on the norm Li (fH|) . the network of interactions (Jo)y is highly diluted, and 
we expect Xo^ well-conditioned, too. Further investigations of this point would be very useful. 

According to the discussion of Section IIV CI inverse problems corresponding to Ising models on finite-dimensional 
lattices are well-conditioned in the perfect sampling limit. The introduction of a threshold over the minimal values 
of cluster-entropies allows us to force the inverse problem to be well-conditioned even in the presence of sampling 
noise. We have checked this statement on inverse problems corresponding to 'critical' Ising models. While the 
correlation length increases with the size of the system, the maximal size of the clusters, Kdu, remains roughly 
constant. Therefore, the computational complexity of the algorithm increases only linearly with the system size. 

An essential feature of inverse problems is that data are generally obtained from a finite sampling and, therefore, 
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FIG. 31: Computational time of the cluster algorithm at threshold O* for the different examples shown Section IVII The 
computational time grows with the number of processed clusters, which depends on the structure of the interaction graph and 
on the number of sampled configurations more than the sole number of variables, A'^. For a fixed maximal size of clusters, 
Kmax, the running time is roughly proportional to the number of clusters. Unless explicitly stated otherwise, the sampling is 
realized in good mixing conditions. Times were measured on one core of a 2.8 GHz Intel Core 2 Quad desktop computer. 



frequencies and pairwise correlations are plagued by sampling noise. Avoiding overfitting is a primary goal for an 
inference algorithm. This goal is achieved, in our algorithm, by the introduction of the threshold 0. As a result most 
of the clusters are discarded, and in particular, those whose contributions would convey very little information about 
the true nature of the underlying interaction network. Fixing the threshold value such that the relative reconstruction 
errors tp and tc are of the order of one corresponds to the maximal accuracy allowed by the quality of the data. 

The cluster expansion introduced here differs from other classical cluster expansions, developed in the contexts of 
the theory of liquids and of computational physics. In particular we do not impose consistency equations for the 
marginal probabilities over the clusters. Our expansion scheme is simpler, and requires only the knowledge of the 
individual and pairwise frequencies of the variables in the cluster. Moreover, the cluster construction and selection 
rules prevents any combinatorial explosion of the computational time. 

Several points would deserve further investigations. Among them the discussion of the convergence properties of 
the expansion, started in Section IIVDI should be expanded and improved. A natural and interesting question is to 
ask how the series behaves when the packets of Fig. 1121 start mixing, i.e. in the presence of a strong sampling noise. 
Another aspect which should be better understood is the influence of the construction rule. Our heuristic consists in 
merging two almost completely overlapping clusters of size K to build a new cluster of size K+ \ (provided its entropy 
is larger than O). This rule has a simple intuitive interpretation, compatible with the notion of interaction path, and 
attempts with other rules have been less fruitful. However, a deeper theoretical understanding and justification is 
clearly needed. Last of all, the a posteriori validation of the method relies on the use of a Monte Carlo simulation 
to calculate ec and Cp. We have tested another procedure to avoid the use of a Monte Carlo calculation, based on a 
partial resummation of the cluster contributions corresponding to the free-energy (at fixed couplings and fields) . This 
procedure, whose applicability goes beyond the inverse problem, will be detailed in a further publication. 

Acknowledgements: We are grateful to J. Barton, J. Lebowitz, E. Speer for very useful and stimulating discus- 
sions, in particular regarding the correspondence between the inverse susceptibility and the direct correlation functions 
and the practical implementation of the inference algorithm. We thank E. Aurell for pointing to us the difference 
between P and Q, see Section IVID II 
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Appendix A: Optimal choice for the regularization parameter 7 



In this Appendix we discuss how the optimal value for the parameter 7 in the L2-regularization (|13p can be 
determined. As explained in Section III Bl the regularization term can be interpreted as a Gaussian prior Pq over the 
couplings. Let us call the variance of this prior. Parameters 7 and are related through 

7P^(1-P)^-^, (Al) 

where we have assumed that the 1-site frequencies pi are uniformly equal to p. To calculate the optimal value for 7, 
or, equivalently, for a^, we start with the case of a single spin for the sake of simplicity, and then turn to the general 
case of more than one spin. 



a. Case of N — 1 spin 



For a unique spin subjected to a field h the likelihood of the set of sampled spin values, {o''^}, is P/Jcr] 



exp(Bp/i)/(l 



Ji\B 



Here p denotes the average value of the spin over the sampled configurations ([2]). We ob- 



tain the a posteriori probability (1151) for the field h given the frequency p, 



post 



[h\p] 



exp(-/iV(2cr2) + Bph~B log(l + e^))/V2^ 



(A2) 



V{p,B,a^) 

where the denominator V{p, B^a^) (marginal Ukehhood) is simply the integral of the numerator over all real-valued 
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FIG. 32: Logarithm of the marginal likelihood (with a minus sign, and divided by the size B of the data set) versus variance 
of the prior distribution of the field. Parameters are B = 100, p = .02. 



For more than one spin calculating the marginal likelihood would be difficult. We thus need an alternative way 
of obtaining the best value for . The idea is to calculate / through a saddle-point method, and include the 
Gaussian corrections which turn out to be crucial. This approach is correct when the size of the data set is large. A 
straightforward calculation leads to 



/~log(H-exp(/i*)) 



-ph* + ^-{h*f + ^\og 



1 + i 



exp(-/i*) 



F (l-fexp(-/i*))2 
T h* ~{). I decreases from I{a^ 



0) 



(A3) 
log 2 



where F = l/{B(j^) and h* denotes the root of (1 -I- exp(— /i*))^^ — p 
with a strong negative slope, dI/da^{Q) ~ —B p^, and increases as logiT^/(2_B) for large values of the variance. 
Expression (IA3|) cannot be distinguished from the logarithm of the true marginal likelihood / shown in Fig. [351 
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Case of N > 2 spins 



The above saddle-point approach can be generahzed to any number N of spins, with the result 

1 / H 

— log dct I IH 

2B ^ V r 



/ - Si,,ng [{h^, J^j}\p] + — log dct ( 1 + — ) (A4) 



where Sising was defined in ([7]) and x is the N + ^N{N — 1) = ^N{N + l)-dimensional Hessian matrix composed 
of the second derivatives of Sising with respect to the couplings and fields ((T(I| . In principle x could be diagonalized 
and the expression (jA4p calculated. However this task would be time-consuming. As we have seen in the previous 
subsection we expect / not to increase too quickly with (for not too small variances) and approximate calculations 
of / can be done under some data-dependent hypothesis. We now give an example of such an approximation, valid in 
the case of multi-electrode recordings of neural cell populations. 

A simplification arises when the number B of configurations and the frequency p are such that: (a) each spin i is 
active {— 1) in a number of configurations much larger than 1 and much smaller than B i.e. 1 <^ B x p B; (b) the 
number n2 of pairs of spins that are never active together is much larger than one and much smaller than ^LL^Lzll^ 
These assumptions are generically true for the applications to neurobiological data. For instance, the recording of the 
activity of iV = 40 salamander retinal ganglion cells in ^ fulfills conditions (a) and (b) for a binning time At = 5 msec: 
a cell i firing at least once in a time-bin corresponds to (Tj = 1, while a silent cell is indicated by Ci = 0. More precisely: 
(a) the least and most active neurons respectively fire 891 and 17,163 times (among B = 636, 000 configurations); (b) 
n2 = 34 pairs of cells (among 780 pairs) are never active together. 

Condition (a) allows us to omit the presence of F in the calculation of the fields, hi ~ logpi, to the first order of a 
large (negative) field expansion. Condition (b) forces us to introduce a non-zero F to calculate the couplings, with 
the result that interactions between pairs i,j of cells not active together are equal to Jij ~ logF -I- 0(loglog(l/F)). 
Finally we obtain the asymptotic scaling of the entropy when F — > 0, 

Sising ^ «2 ^ (l0gF)2 + O (f log F log log ^ ) . (A5) 

We are now left with the calculation of the determinant in (|A4[) . From assumption (b) the number of pairs of neurons 
not spiking together is small with respect to N'^ , meaning that most of the eigenvalues A" of the Hessian matrix of 
Sising are non zero. Hence, 

logdet(l + ^)= Yl log 1 + ^U-^ logF. (A6) 
Putting both contributions to / together we get 

F A^^ 

/(F)c.n2 - (logF)2-_ logF. (A7) 
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The optimal value for the variance is the root of 



(F) = ~ ^ (logF)2 - — - ~ ^ {\ogBf - — aK (A8) 



dr^ ' 2 ' ° ' 45 F 2 

We finally deduce the optimal variance 

2_ ^logi?^' 



For the data described above we find a ~ 8. 



a^^2n2 (^) • (A9) 



Appendix B: Expression of the entropy of clusters with size K = Z 

In this Appendix, we give the analytical expression for the entropy of a cluster with K = ?> spins. Using this 
expression instead of minimizing the cross-entropy ([7]) offers a valuable computational speed-up as there are 0{N^) 
clusters of size K — 2>. We start with the definition of the entropy P{cr)'. 

Ss^- P{^i,<^2,<J3) \ogPiai,a2,a3) . (Bl) 

0-1=0,1 

CT2=0,1 
(73=0,1 
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We then replace the probabihties P{ai,a2, ca) of the eight configurations of the three spins above with their expressions 
in terms of the probabihties {pi,Pki} hi the data, and of the probability P123 that the three spins are equal to 1: 



^^(1,1,1) 


= P123 




P(1,1,0) 


= P12 - P123 




P(1,0,1) 


= Pl3 -Pl23 




P{0,1,1) 


= P23 ~ P123 




P(0,0,1) 


= P3 - P23 - Pl3 + 


P123 


P(0,1,0) 


= P2- P12 - P23 + 


P123 


P(1,0,0) 


^ Pi- Pl3 - P12 + 


J3123 


P(0,0,0) 


= 1 - Pi - P2 - P3 


+ P12 + Pl3 + P23 ~ P123 



(B2) 

The only unknown quantity (not available in p) is the probability pi23- To determine P123 we impose 

^ = , (B3) 
api23 

which means that the three-body coupling J123 vanishes. Condition (|B3[) gives a third degree equation on P123, 

P?23 + aP?23+/3pi23+7 = (B4) 

with 

a = P1P2 +P1P3 + P2 P3 - Pi P23 - P2 P13 - P3 P12 - P12 - P23 - P13 , 

P = P1P23 +P2?'?3 +P3P?2 -P1P2P23 -P1P2P13 -P1P3P12 -P1P3P23 - 

P2 P3 P12 - P2 P3 Pl3 + 2 P12 Pl3 P23 + Pl2 Pl3 + Pl2 P23 + Pl3 P23 + Pi P2 P3 , 
7 = Pi P2 + P3 (1 - J3l - J32 - P3 + P12 + Pl3 + P23) • (B5) 



Upon substitution of P123 in (jBip we obtain the desired cross-entropy S3, as a function of the three average values pi 
and the three two-point averages pki ■ The expression of the cluster-entropy is given by, 

AS'(ij,fe) = S3{pi,Pj,pk,Pij,Pik,Pjk) - AS'(^,j) - AS'(i^fc) - AS'(j_fe) - AS'(,) - AS'(j) - AS(^k) , (B6) 



according to ((28|) . The expressions of the cluster-entropies for one and two spins are given by, respectively, p3|) and 
P5|) . Similarly, one obtains the expressions for the contributions of the 3-spin cluster to the values of the interactions 
parameters by differentiating AS* with respect to the pi's and the pfc;'s. 

Appendix C: Leading diagrammatic contributions to small cluster-entropies 

We analyze the dominant diagrams contributing to the cluster-entropies for the various values of the cluster sizes, 
K, in the limit of small connected correlations Cki- 

1. Case K ^2 

The entropy AS'(jj) of a 2-spin cluster is the sum of all diagrammatic contributions containing two spins and an 
arbitrary number of links between them, corresponding to the power of the expansion parameter My = Cij/{pi (1 — 
Pi)pj (1 —Pj)) (Fig. [3]) and Section UlI C II For small values of My the largest contribution to AS'(ij) is the one with 
three links (cubic power of M^j), if the reference entropy Sq = Smf removes the two-link loop diagram. The entropy 
contribution of this diagram was computed in [30j , with the result 

A^lJ=a.,,(c.,)^ (CI) 

where 

^(3) ^ (2P.-I)(2p, -1) . . 

""^'^ 6{p.y{i~p.,y {p,y{i-p,r ■ ^ 

The superscript 3 refers to the power of the connected correlation. 
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c) j 



^ ^ ^ 
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e) J 





i k i k i k 

FIG. 33: Leading diagrams to the order 5 (top) and 6 (bottom) in the connected correlation for the entropy of 3-cIusters. 



Case K = 3 



For K = 3 the leading term to AiS'(i ^ ,t) in powers of was not derived analytically in [30|- Based on the studies 
of the unidimensional Ising model and the independent spin models (Appendix |F|, we find that the leading diagrams 
are diagrams (a), (b), (c) in Fig. [33] (bold diagrams in Fig. [3]), whose sum is given by 



with 



^^ij,k = ^ijk (Cy)^ {C]kf Ck^ + a)^l {c^,f Cjk {c^f + a)^^ Cy [Cjkf {ck^f 

{2p, - 1) (2pfe - 1) 



ijk 



(C3) 



(C4) 



Note that a\^l, differs from a^ jj.. We have also found the coefficients of the subsequent diagrams, of the order of . 
These diagrams are labelled by (d), (e), (f) in Fig. [321 Their total contribution to the cluster-entropy is 



„(6) 



(6) 



„(6) 



with 



ijk 



{2p, - 1) {2pk - 1) 



3 {p^r (1 - ip,r (1 - p,r iPk? (i - Pk? 



(C5) 
(C6) 



3. Generic case K > A 

The above results for K — Z are easily generalized to any value of the cluster size K > 4,. The diagrammatic 
expansion of a if -spin cluster includes all circuits where pairs of spins are linked together. Each diagram with (one 
or two) links between ii and {I — 1, . . . ^ K — 1) and (one or two) links between ii and ik gives 



AS 



{2K-1) 



2n/=ife,)Mi-K,)^ 

+ (2p»._. - 1) {2p^,_, - 1) {c,,,,,f (c,,,,3)2 



(2p»fc-i - 1) (2p», - 1) (cji,»J^(c,,,,3)2 . . . (cj,_2,,j,_ 



(C7) 



At the next order in power of Mij, each diagram with three links between ii and ij+i (l ~ 1, . . . , iiT — 1) gives a 
contribution 



AS 



(-1)^-^2^,-1) (2k, -1) 



K-l 



3(p,J2 (1 _p,J2 (p,J2 (1 -p,J2n^-l(pJ3 (1 



Y\ {(^il.il + l) 



(C8) 



1=1 
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numerics numerics 

FIG. 34: Comparison of tiie numerical (i-axis) and tlieoretical (j/-axis) values for the entropies of clusters {i,j) (a) and {i,j,k) 
(b). The system is made of A'' = 40 independent spins, with the same pi as in the neural data of Ref. 3; the average value of 
the Pi's is p — 0.0238. Theoretical predictions correspond to (|C3[) and (|C5|) . The number of sampled configurations is -B = 10®. 



Appendix D: Critical correlation length for the absolute convergence 

In this Appendix, we briefly explain why the cluster-entropy series is absolutely convergent if and only if the 
correlation length is smaller than 

^ " (Dl) 



logw 



Here, fl = 2 when the reference entropy is 5*0 = 0, and = 3 when Sq = Smf- Parameter v denotes the number of 
neighbours of a site on the lattice, supposed to be uniform. For instance, v = 2D on a hypercubic lattice in dimension 
D>1. 

Consider a set of K distinct points on the lattice. Let M{L) be the number of closed paths of length L visiting all 
K points. We obviously have N{L) < . Hence, the series 

^JV{L) cxp(-nL/A (D2) 



is convergent if ^ < £,c- Reciprocally, let Lq be the length of the shortest closed path Cq encircling the K points. A 
closed path of length Li + Lq can be built from Cq by attaching a closed loop of length Li to any one of the sites in 
Co. Hence, for L>Lo + 2, N{L) > Lq v^~^°. We deduce that the series (jD2p is divergent if ^ > £,c- 



Appendix E: Distribution of cluster-entropies for the Independent Spin model 

We generate B configurations of N independent spins Ci. Spin i is equal to 1 with probability p and to zero with 
probability 1 — p (for simplicity we assume here that all the frequencies pi are equal to the same value p). The 
empirical connected correlations computed from the B sampled configurations of spins are generally non zero. The 
marginal distribution of Cij is a normal law, with zero mean and standard deviation ()49|) . The largest values of the 
correlations are, for a system with N spins, of the order of 

cff^"" = CB v/IbilV , (El) 

according to extreme value theory. 

We compare in Fig. |34] formulas (jC3p and (jCSp for, respectively, the cluster-entropies ASij and ASijk with numerics 
carried out from randomly sampled configurations. Each pair (i, j) fFig. IMT a)) and triplet k) (Fig. |34tb)) define 
a point, whose coordinates are the numerical and theoretical values of the entropy corresponding to the pair- or 
triplet-cluster. The agreement, for B = 10^ sampled configurations, is excellent due to the small value of cb ~ 2 10~^. 
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Numerics 
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^6x10 -4x10 -2x10 



2x10 4x10 6x10 



AS 



FIG. 35: Theoretical (red) and numerical (black) histograms His for the entropies AS* of 2-spin clusters in a system of 
independent spins and B — l(f configurations. Simulations were done with A*' = 40 spins, with heterogeneous pi ~ .0238, see 
caption of Fig. 1341 



1. Distribution of cluster-entropies for K = 2 

The distribution of the entropy oi K = 2-chisters for a set oi B = 10^ configurations is shown in Fig. [35] To 
derive the analytical expression of the distribution in the N ^ oo hniit, we use the small-correlation formula (jCip for 
^'S'(i.2)j and the fact that the distribution of the connected correlation is Gaussian. As a result, approximating ai^2 
with its average value a obtained by substituting pi and p2 with p in (jC2[) . we obtain 



HjsiAS,, = , (E2) 

3ai/3y2^(^(A5(i,2))2/3 

This distribution is a stretched exponential at infinity, and diverges in zero. Its standard deviation is 

a.s,,,=yi5a(c.)3 = ^^M^ (E3) 

For B — 10^ and p = 0.0238 we obtain that the standard deviation is ~ 2.710"^. Distribution (IE2[) is compared to 
the histogram obtains from numerics in Fig. 1351 The standard deviation and the distribution at small entropies are 
in good agreement. Large values of the correlations ()E1|) give rise to isolated values of AS'(i 2), of the order of 

A5(Y,2f ^ {4\ogNf/' (((A5(,,,-))2))'^' , (E4) 

approximately equal to 1.2 10^^ for N — 40. This value is about twice the largest cluster-entropy observed in Fig. [551 
for one particular realization of the sampled configurations. 



2. Distribution of the cluster-entropies for K — 3 

The leading order contribution to the entropy of a 3-cluster is given by (|C3p . We want to calculate the distribution 
of AS'(]^ 2,3) when the connected correlations are random Gaussian variables, of zero mean and variance (cb)^. 
We neglect the correlations between ci2,ci3,C23, which is legitimate for large B. Let us call x = AS (^1^2,3) /{ceicB)^), 
with a given by ([C4p . and let P{x) be the probability density of x. Though we have not been able to find a closed 
expression for P{x), the asymptotics behavior of P for large or small arguments can characterized analytically. 
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a. Large x behaviour 

The Mellin transform of P [111 is 

3/2 



j dxP{x)x^^{-\ j dci2dci3dc23 e-^^'^i^'^i^'^^^) (E5) 
where 

P(C12,C13,C23) = -^(Ci2 +C?3 +C?3) + A log(ci2C?3C^3 + C?2Cl3C23 + C?2C?3C23) • (E6) 

The tail of P{x) at large x can be studied by considering large values of A. We expect the dominant contribution 
to the multiple integral on the right hand side of (|E5p to come from large correlations. The location of the main 
contribution to the integral is the value of (ci2, C13, C23) which maximizes F . As F is invariant under any permutation 
of its arguments, we look for a maximum where C12 = C13 — C23 = c*. A straigthforward calculation shows that 

c*{X) = ^|l\ F*(A) = ^AlogA + A('log3+^log^-^') . (E7) 



We now use the saddle-point method again, this time to estimate the integral on the left hand side of (IE5[) . We obtain 

max [ log + A log x] = F* ( A) , (E8) 

which is true when A is very large. Hence, F*{X) is the Legendre transform of logP(x). Solving (|E8[) gives 

\ogP{x)^~- (-) (E9) 

at large x. The distribution of the cluster entropies AiS'(i 2,3) thus follows a stretched exponential with exponent |. 
This decay is much slower than an exponential, and leads to large tails as can be seen from Fig. |9l 

b. Small X behavior 

In order for the rescaled entropy x to be small, at least one among the three correlations should be small according 
to (jC3p . Without restriction, we may assume that C12 is the smallest of the three correlations. As C12 appears once 
with power one, and twice with power two in (|C3p . we approximate x ~ ci2cl^C23- The Mellin transform of P is, for 
negative A, 

dxP{x)x^ c,3(J dc-^c^ e--V2j (J dc^c^^e-^^'A . (ElO) 



The largest pole is located in A — — i, and is of order 2. According to standard results on the inversion of Mellin 
transforms [36|, we obtain a precise characterization of the divergence of the probability density at small x, 

- C , (Ell) 

\/x 

where C is a constant. 

c. Typical value of x 

The typical value of the x is defined through 

xtyp — exp dxP{x) logx^ ■ (E12) 
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This quantity is less sensitive than the average value of x to the presence of the long tails in P{x) at large x. We 
write X = (012013023)^ z where 

z^ — + — + —. (E13) 

C12 Ci3 C23 

Taking the logarithm, and averaging over the correlation, we obtain the following expression for the average value of 
the logarithm of x, 

(logx) = 6 ( r dc'^^e-^'A + (logz), . (E14) 



27r 

The integral over c in the above equation can be calculated numerically, with a value ~ —.63518. To calculate the 
average value of logz, we first use the identity 

\ogz^ / — (e^"-e-"") . (E15) 
Jo 

Taking the average on both sides, we have 

/•OO 7 

(logz).= / ^(e-"-(e-"-),) . (E16) 
Jo " 

As z is a sum of independent random variables its Laplace transform is the product of their Laplace transforms. 



-io,o 



(E17) 



where G is the Meijer-G function, we have calculated the integral (jE16l) using the Mathematica software. Some care 
must be taken for the numerical accuracy when z — >■ 0. The outcome is (logz)^ ~ 2.09643. Putting all contributions 
together we obtain Xtyp — 0.18. The corresponding values of AS'(i 2,3) are 3.5 x 10^^^ for B — 10^, and 1.1 x 10~^^ 
for B = 10^, in good agreement with the numerical value, respectively, 3 x 10~^^ and 9 x 10~^^. 



d. Standard deviation of x 

We can easily evaluate the variance of each of the three terms of the sum in (jC3p as the product of the variances 
of the three terms in the product, based on the approximation that the connected correlations Cij are independent 
stochastic variables. We obtain 

_ 3 73(2p-1)^(cb)^ _ 3^/3(2p-l)^ 
'^AS.,fe 2p6(l_p)6 2p(l-p)B5/2 ■ ^ «^ 

With the values of N and p chosen in Fig. |35l we find that the standard deviation is of the order of 10^^'^ for B = 10®, 
and 2 10"" for B = 10^ see Fig. M 



3. Distribution of cluster-entropies for generic K > 4 



In general, for K > 3, the leading contribution to AS'(ij^.i2_...i^) (|C7P contains the sum of if x {K — l)!/2 terms, 
each one being the product of K random variables, among which (K — 1) are elevated to power two, and 1 is elevated 
to power 1 . The factor K comes from the fact that there are K way of choosing the single link in the circuits with K 
spins. The factor {K — l)!/2 is the number of non equivalent circuits going through K spins. We define the rescaled 
entropy x through 
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The approach followed in Section IE 21 to calculate the asymptotic behaviour of the probability density P of x for 
K = 3 can be extended without difficulty to any value oi K > 3. We find that P{x) diverges when a; — ^ 0, with 

P{x) = C t}^^i^ . (E20) 

where C is a constant. Hence the shape of the distribution of x is, up to logarithmic terms, independent of K. On 
the contrary, the tail of the distribution for large x is very sensitive to K, 

K / T \ 2/(2^-1) 

logP{x)^-—±-^ ^) (E21) 



2(X-i)2 \K 

As in the K = 3 case, the distribution of the cluster entropies AS" follows a stretched exponential. The exponent of 
the stretched exponential decreases with K. The variance of the distribution can be easily evaluated, with the result 



2{p{l-p)y 



Appendix F: Properties of the cluster-entropies of the one-dimensional Ising model 

Consider the one-dimensional Ising model with nearest-neighbour couplings and periodic boundary conditions. The 
Hamiltonian of the model is 

H = -h'^cTi - j'^aia^+i , (Fl) 

i i 

where the spins ai take 0,1 values. The parameters of the model are the N identical fields hi = h, the N couplings 
Ji^i+i — J between neighbours and the remaining N x {N — 3)/2 zero couplings J^j = between non neighbours. 
We recall a few elementary facts about the model. The transfer matrix is 

,J+/i h/2 



T=[ >2 \ ) • (F2) 



The eigenvalues are \± — \ (^e'-'^^ + 1 ± ■\/ {e-^+^ — 1)2+4 e''^ , and the two components of the eigenvectors are, 

respectively, u±(l) = -(1 - \±)/yJe^ + (1 - A±)^ and v±{2) = e'^/'^/^e^ + (1 - >^±Y- The probability that a spin is 
up is given by, in the N ^ oo limit, 

p^{a,)j^{v+il))\ (F3) 



and the connected correlation at distance d is 

a^i+d = {(Ti<Ti+d)j - (CTi)j(cr,;+d)j =p{l-p) ( ^ ) =p(l -p)cxp(-d/^) , (F4) 
where the correlation length is given by ^ — 1/ log(A_/A+). 



A+ 



1. Calculation of the cluster-entropies and cancellation property 

In this Section, we show the exact cancellation property between the entropies of clusters with different sizes 
discussed in Section IIII C 21 We will see that this property is a direct consequence of the existence of a unique 
interaction path along the unidimensional chain. 

a. Case So = 

We first consider the case where the reference entropy is zero. Let F = (ji,i2, • • • ,iK) be a cluster of size K, with 
ii < 12 < ... < ix- Due to the unidimensional nature of the interactions, the Gibbs distribution over the ii'-spin 
configurations cr obeys the chain rule. 
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where P{-,-) P{-\-) denote, respectively, joint and conditional probabilites. Inserting the above formula into 
expression ^ for the cross-entropy, we obtain 



5'/^mg(J|p) = -^Poh..[cr]f ^ l0gPj(cr,,^JcriJ-|-l0gPj(cri2,Cr,J j 

/K-l K-l . 

= - ^ Fobs W] ( log ("H+i ' ^0 - E log P' (^^' ) ) 



1=2 

K-l K-l 

IT ^ ;=i 1=2 ' 

K-l K-l 
1 = 1 1=2 

Each variable CTi, , with I — 2, . . . , K — 1, appears three times in (jF6p . which explains the presence of three fields h 
with the same index ii. After optimization over J = ({Ji,, 1,4.1}, {'li^}, {'^-t^li {''■i,}) s-l^ these fields are equal, and we 
obtain 

K-l K-l K-l K-l 

S{p) = E - E ^iP^') = E ^%,^,+i)(P) + E ^%)(P) ■ (F7) 

;=i 1=2 1=1 1=2 

Hence the cross-entroy S'(p) is the sum of the 1-cluster entropies and of the entropies of the 2-clusters made of 
adjacent sites. None of the other cluster-entropies appear, which proves that they cancel each other. To illustrate the 
cancellation mechanism, consider the case X = 3. According to (jF7|, 



5(p) = A5(,,,,,)(p) + A%^,3)(p) + A5(,,)(p) + A%)(p) + A5(,3)(p) . (F8) 
Comparing with (|28p we obtain 

A%,,„,3)(p) = -A%,,3)(p) , (F9) 

which shows that the entropy of a 3-cluster and the one of a 2-cluster with the same extremities 11,13 are opposite 
to one another. By a recursive applications of (jF7P this result can be immediately generalized to higher values of K . 
The entropy of a if-cluster is simply the entropy of the 2-cluster with the same extremities, multiplied by (—1)^"^. 
Hence, identity p4|) is established. 

According to formula (|F4p for the connected correlation, the entropy of a two-site cluster is a function of the distance 
d between the two sites: 

A%,+rf) =F(exp(-d/0) , (FIO) 

where 

F{u) = -2p(l - p)(l - u) log(l -u)- p{p + (1 - p)u) log (1 + ~ ^^" ) 

- (l-p)(l-p + pw)log(l + ^) . (Fll) 

To obtain the expression (jFlip for F, we have used formula (|26p for the 2-spin cluster-entropy, with pi = P2 = P and 
P12 = p'^ + C12, where the correlation C12 is given by (jF4l) . Note that F{u) = 0{u'^) for small u, in agreement with 
scaling (|iT|) . 

b. Case So = Smf 

We now introduce the reference entropy = Smf- The matrix M defined in (I2ip has elements 



M,, = ^_^=== = exp(-|* - . (F12) 



\/pi{l~Pi)pj{l-Pj) 
The inverse of M, G = M~^, is a tridiagonal matrix, whose non zero elements are 



_l + exp(-2/e) _ exp(-l/0 

1 - exp(-2/0 1 - exp(-2/^) 
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6<d<16 



(a) 



(b) 



5 <d< 15 
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FIG. 36: Histograms of the entropies for clusters of size 2 (a) and 3 (b); in the latter case, entropies are negative. Data are 
generated from the unidimensional Ising model (|F1|I with N = 30 spins, and parameters J — 4 and h = —6. Each peak is 
labelled by the distance d between the extremities of the clusters. The reference entropy is 5*0 — Smf- 



Consider now the Gaussian model over N real- values variables ipi, whose energy function is given by 

For this Gaussian model, the logarithm of the partition function is (up an irrelevant additional constant), logZ[G] = 
— ilogdetG. By construction, model (jF14p is the solution of the inverse Gaussian problem, with data: {(pi) = 
0, {(pi(pj) = Mij. Hence, 5*0 can be interpreted as the cross-entropy of Gaussian model (|F14p under those data. A 
key feature of the Gaussian model above is that its interaction matrix is tridiagonal. Only nearest neighbour 
variables are coupled to each other according to (jF13p . We conclude that the Gaussian model is a one-dimensional 
model. Consequently, it obeys a chain rule similar to (jFSp . This is the only requirement for the main conclusion of 
Section IF 1 al to hold: in the cluster expansion of Sq, the entropy of a if -cluster is simply equal to the entropy of the 
2-cluster with the same extremities, multiplied by (— As both the expansions of S and the one of enjoy this 
property, so does the expansion oi S — Sq. 

We conclude this Section by the expression of the 2-cluster entropy AS^^ci)- In the presence of the reference 
entropy Sq = Smf, we substract the following contribution to expression (|F10|) . see (PT|) . 

Hence, function F{u) defined in (jFlip should be substracted ^ log(l — u^). It is a simple check that F{u) — ^ log(l — 
u^) = 0{u^), in agreement with scaling (|46l) . 



2. Examples and calculation of diagrammatic coefficients 



We now show the histograms of cluster-entropies for K — 2 and if = 3 for specific choices of J,h. The averages pi 
and pij were calculated exactly through formulas (|F3p and (|F4I) (perfect sampling). Figure [36la) shows the histogram 
of entropies for clusters of the type + d). Entropy values are discrete and labelled by the distance d. They range 
from 10^^ (for nearest neighbours, distance d = 1) to values smaller than 10~^^ for 6 < d < 15. All entropies smaller 
than the numerical accuracy ~ 10~^^ are put in the peak at the origin. Expanding F{u) to the lowest order in u (for 
So — Smf) we find the asymptotic formula for the 2-cluster entropy: 

A5M+.^i^^r^e-3^/«, (F16) 
6p{l-p) 

in agreement with (|Cip . We have verified that this formula is in very good agreement with the numerics as soon as 
d > 4 for the parameters of Fig. [36l 

Figure [5CTb) show the histogram of the entropies of 3-clusters k). Let d — k — i he the distance between the 
extremities. We observe that the entropies are gathered into peaks, and are exactly the opposite of the ones found in 
Fig. IMT a) as expected. Two differences are: 
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• The peak in d = 1 is not present because the minimal distance between three spins is d = 2. The largest 3-spins 
entropy thus corresponds to triplets of the type (i, j + 1, « + 2). 

• The height of the peak (number of clusters) corresponding to distance d is {d — 1) N. The degeneracy {d — 1) 
is the number of ways of choosing the location of site 12 in between ii and i^. 

We now show how the value of the cluster entropy can be found back from the leading terms in the diagrammatic 
expansion calculated in Section fC 21 Let us call d' — j ~ i < d the distance between the first two sites in the cluster. 
For each diagram in Fig. [33] we give in Table IF^ the sum of the distances of its links, i.e. the power of exp(— 1/^). 



diagram 


sum of distances 


a) 


d + 2d' + 2{d-d') = 3d 


b) 


2d + 2d' + {d- d') = 3d + d' 


c) 


2d + d' + 2{d - d') =4d-d' 


d) 


3d' + 3{d - d') = 3d 


e) 


3d + 3d' 


f) 


3d + 3{d-d') =6d-3d' 



Interestingly, the lowest total distances are found in diagrams a) and d), while the latter diagram is of a higher 
power (6) in terms of the correlated function than the former (5). Hence, contrary to the case of independent spins 
(Section |E|, diagrams a) and d) give the dominant contributions to the entropy. Summing the contributions of a) 
and d) we find 



AS*, 



3d 



(F17) 



To derive the coefhcients a^^'^ and a^^\ we impose that AS'j^i ^ fc) is the opposite of (jF16p . We deduce that a'^^) and 
a^^^ are given by, respectively, (|C4[) and (|C6p . 

The exact cancellation property discussed above has important consequences for the inferred fields and couplings. 
Consider for instance the coupling which vanishes in the ID-Ising model with nearest- neighbour interactions 

(|Fip . As the connected correlation Ci^i^2 is not equal to zero, a contribution to the coupling will be collected from 
the cluster (z, i + 2) itself, equal to 



AJ, 



OAS, 



i,i+2;{i,i+2) 



{i,i+2) 



Other contributions will come from larger clusters. For instance the cluster + l,i + 2) will give an additional 



(F18) 



AJ, 



OAS, 



i,i+2;{i,i+l,i+2) 



{i,i+l,i+2) 



+2 



(F19) 



The sum of the two contributions above vanishes due to the cancellation property. It can be checked that the 
contributions coming from all the other clusters vanish, too, which makes the coupling Ji.i+2 = as it should. 



Appendix G: Inverse susceptibility matrix for the unidimensional Ising model 

Hereafter, we want to invert the matrix x, whose elements are given in (j6ip . The matrix is of dimension ^N{N ~ 1), 
and each element is labelled by two indices and (k, I), with i < j and k < I. Each index can be represented 
by a site of coordinates i and j on the half-grid of Fig. ISTT aV We now show that the non-zero entries of the 
inverse susceptibility matrix, are in one-to-one correspondence with the sites and {k,l) that are 

either identical, or nearest neighbours, or diagonally opposed on the elementary mesh of the half-grid (Fig. ISTT b.c.d)). 
Depending on the value of the difference j — i, the number of those sites is equal to 9, 8, or 6. 

We start with the case j — i > 3 (Fig. ETTb)). By symmetry, the nine unknown matrix elements (x^^) ■ ■ i., take 

only three independent values, denoted by 7 for {k,l) — /? for {k,l) and (i,j) nearest neighbours, and a for 

{k, I) — (i ± 1, j ± 1). We now write the matrix inversion identity, 

{X~^)ij^kl XkLmn = <5j,m Sj^n , (Gl) 

k<l 



55 



lOp OOOOOOOOO 

- oooooooo 
8-0000000 

- o o o o o o 
6 - o o o o o 

J - o o o o 



4 - o o o 
- o o 



(a) 



lOp oo###oooo 

- 00#J8;#000 
8-0 o ••• o o 

- o o o o o o 
6 - o o o o o 

J - o o o o 



4 - o o o 
- o o 



4 6 

i 



(b) 



lOp OOOOOOOOO 

- oooooooo 
8-0000000 

- o o # # # o 
6- o o • X • 

j o o • • 



4 - o o o 

- o o 
2- o 



4 6 

i 



(c) 



OOOOOOOOO 

oooooooo 

O O O O o o o 

o o o o o o 

o o • • • 

o o # K 

o o # 

o o 

o 



(d) 



FIG. 37: Half grid representing the index of the entries of the inverse susceptibihty matrix, with i < j (a). Black circles 
locate the nearest-neighbours and the diagonally opposed sites {k,l) of (cross), with i = 4 and j = 9 (b), 6 (c), 5 (d). 



for various values of (m, n). Let d = j — i. For m =^ i,n = j , constraint (IGll) gives 

^(1 _ ^2d) ^ 2/3(2x - x^''-! - x^'^+i) + a(4x2 - x^^-^ - x^" ~ x^^+^) = 1 , (G2) 
which should hold for all d > 3. We deduce two coupled equations for the three unknown variables: 

7 + 2(a; + ^)/3 + 4(x+i)^a = , (G3) 

7 + 4j;/3 + 4a;2a = 1 . (G4) 

For m = i + 1, n = j, constraint (|Gip is equivalent to 

-f[x - x""-^) + /?(! + ix" - x""-^ - ix"") + a{2x + 2x^ - x"""^ - Ix^"-^ - x^''^^) = . (G5) 

The d-dependent term in the equation above cancels by virtue of (IG3I) . We are left with an additional equation over 
a, ^,7: 

7x + /3(l + 3a;2) + 2a;(l + a;2)a = . (G6) 

By symmetry of the matrices x^^, no new constraint is obtained when the values of m, n are further varied. Solving 
(lUSl) . (lG4l) . (lG6l) we obtain 

P = " n I2T2- ' T = 7^ I2T2 ■ 



(l-a;2)2 ' ^ (l-a;2)2 ' ' (1 - a;2)2 

The analysis of the other cases j — i + 2 (Fig. ISW c)) and j = i + 1 (Fig. ISTT d)) can be done along the same lines. 
We do not write the calculations in details, and simply report the results. The case j = i + 2 is very similar to the 
previous case. There are 8 coefficients to be calculated, with three independent values, a',/3',7'. It turns out that 

a'=a, P' = P, 7' = 7- (G8) 

As for the last case, j = i+l, we call a" the values of the entries of with {k,l) = (i— 1, j— 1), (i— 1, j+l), (i+1, j+l), 
(3" the values of the entries with (fc, /) = (i — 1, j), (i, j + l), and 7" the diagonal element corresponding to (fc, /) = («, j). 
After some elementary algebra, we find 

a=a, /3=/3, 7 = ^-^ _ ■ (*^^) 



All those results are reported in (162 
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